Talk HomeIntroductionNext Slide

Wrong Place? Wrong Time? Or just plain wrong?

The expressive power of the Text Encoding Initiative Guidelines poses significant implementation challenges for large database systems.

Groping my way to a possible general solution that can be implemented on large databases.

Text object model using related SQL tables to manage object hierarchies in conjunction with a full text indexing system that is also aware of object hierarchies for searching and retrieval.

This talk:

Caveat emptor: it all works, in one version or another. Not yet a coherent system


Talk HomeOlsen: Words, Objects and Attributes: Slide 1Next Slide
Olsen: Words, Objects and Attributes
Previous SlideProblematicNext Slide

A paradoxical problem: the expressive power of the TEI.

Ability to define almost any type of textual object in almost any hierarchical configuration with almost any set of attributes.

Expression of TEI in XML: supports wide array of XML tools for document management, navigation, display.

Not clear that these tools will scale to handle large databases. Existing systems are expensive and not aimed at academic environment.

Object model (in PhiloLogic): handles most problems so far encountered, and may be logically extended as a model to handle more.


Previous SlideOlsen: Words, Objects and Attributes: Slide 2Next Slide
Olsen: Words, Objects and Attributes
Previous SlideSQL and Text ObjectsNext Slide

A KISS rule: All data not actually textual may be extracted and managed in hierarchy related SQL tables, corresponding to embedded textual objects.

Textual objects may range from the individual word to databases containing many collections.

Object management is handled by one or more dedicated subsystems configured to use the full power of SQL, including arithmetic, logical, and pattern matching operations.

Marriage of full text and SQL has the advantage of using two distinct components that are robust, fast, and scalable.

PhiloLogic word --> object indexing scheme has been used extensively in many document types and language groups.


Previous SlideOlsen: Words, Objects and Attributes: Slide 3Next Slide
Olsen: Words, Objects and Attributes
Previous SlideText = Objects and AttributesNext Slide

A textual object, in my understanding of SGML/XML, is an arbitrary block of data that may contain additional objects and words or that may be constituent part of an object.

Typically considered in as tree, one may express more than one hierarchy of objects: e.g. "book-chapter-verse" and "pages"

So far, words are the lowest (atomic) level of object.

A word index is then a logical address of nested objects to the word object, e.g.:

Volume-->
     Book-->
        Chapter-->
           SubChapter-->
              paragraph-->
                  sentence-->
                       phrase-->
                                word
For indexing purposes, words are simply given logical addresses without reference to object type or any other data.

Implied object divisions: calculate phrases, sentences, etc. from context if explicit tagging is unavailable.


Previous SlideOlsen: Words, Objects and Attributes: Slide 4Next Slide
Olsen: Words, Objects and Attributes
Previous SlideText Objects in PhiloNext Slide

Text objects are assigned unique identifiers reflecting the object hierarchy, from document/file (not always the same) to word.

These can be used to

The paragraph in Diderot's Encyclopédie, for example, containing the phrase étincelles lumineuses has the logical address:
35:78:0:51
being the 51st child object of the 78th division object, of the 35th file/document.

Descriptive data associated with this object is stored in an SQL table

35:78
a main article Electricité, by d'Aumont, in the class of knowledge Physique, and is associated with the page object 35:43, the 43rd page object (not the page number, 5:469, which is stored in a related table) of the 35th file.

Object attributes are stored in SQL tables and can have different values at different levels.
HeadwordTypeAuthorClass of KnowledgeP.S.Vol:ObjObj ID
LUNEartmd'AlembertAstr.s.f.9:347270:12:0
LuneartsXXXChimie.NA9:347370:12:1
LuneartsVenelChimie.NA9:347470:12:2
LuneartsXXXHist. nat. Chimie, Metallurgie a Mineralogie.NA9:347570:12:3
Lune cornéeartsXXXChimie Metall.NA9:347670:12:4
LuneartsJaucourtMythologie.NA9:347770:12:5


Previous SlideOlsen: Words, Objects and Attributes: Slide 5Next Slide
Olsen: Words, Objects and Attributes
Previous SlideWord Objects in PhiloLogicNext Slide

Distinction: word index entry and word object database.

Internal raw index (from Platts' A Dictionary of Urdu, Classical Hindi, and English):

word Desire 0 212 0 1 1 0 62070 6
word delight 0 212 0 1 1 1 62078 6
word pleasure 0 212 0 1 1 2 62087 6
word taste 0 212 0 1 1 3 62097 6
word relish 0 212 0 1 1 4 62104 6
showing the object hierarchy to the word, with byte offset and page tag. This corresponds to the dictionary entry (shown in non-Unicode display option):
abhiruc (p. 6)
S abhiruc abhiruc, s.f. Desire, delight, pleasure, taste, relish. --- abhiruc rakhna (-men). To have or take delight (in).
The index form of each word serves as the relational operator between the object hierarchy and word object management subsystem.

word upanām 0 595 0 1 0 3 143854 0012
word upanām-vāćī-śabd 0 595 0 1 1 5 143924 0012
word a&ttod;hmanā 0 908 0 1 1 4 226849 0018
word a&ttod;kal-paććū 0 974 0 1 2 3 242980 0019
word &ttodtod;ara&htod;-se 0 1293 0 1 11 17 344921 0027
Entry forms may include any set of characters, including arbitrary SGML entities, ISCII, and Unicode representations of characters, as required.

Word object management subsystem(s) serve as a way to allow the user to

word pattern expansion -- we use a two field database containing the index form and a simplified form of each word in the database.

kalakakā<od;aka
kalyananikalyā&ntod;āni
susrusinśuśrū&stod;in

Or other representations for search and display

This can be logically extended by adding fields containing, for example, word lemmas and other appropriate lexical information to be used for searching, possibly including language attributes or values for objects in which words are to be distinguished but that do not have attributes (notes?).


Previous SlideOlsen: Words, Objects and Attributes: Slide 6Next Slide
Olsen: Words, Objects and Attributes
Previous SlideObject HierarchiesNext Slide

Object hierarchies are constructed by mapping the logical address of words found in occurrence indexes

upanām 0 595 0 1 0 3 (or 0:595:0:1:0:3)

to one or more SQL tables that have as many fields (including joins from parent objects) and the address of the object, which is always a right truncation of any particular word address.

Object hierarchies are completely independent of types or any other information.

35:78:0:51 is a paragraph in an encyclopedia while 140:6:1:3 is a chapter (Chapitre III. La famille) of Sagnac's Législation civile....

The nested hierarchy may be displayed as a table of contents, such as document/file 140

Object Types are managed in the SQL handlers rather than in word indicies. This means that the object types, nesting rules, and the SGML/XML encoding used to represent this information is independent of the identified objects.

No clear technical distinction between

as all are considered simply objects that may be nested within others.
Previous SlideOlsen: Words, Objects and Attributes: Slide 7Next Slide
Olsen: Words, Objects and Attributes
Previous SlideTying it all together: relational objectsNext Slide

Word occurrence indexes have two relational elements
index entry of the word
logical address of the word, including all parent objects
Add a housekeeping relationship: an object to byte offset map.

Everything else logically follows from these three things.

The four step program:

1. Object Selection

colet% ./gimme "author=venel" "class=chim" | head
15:184:1
17:80:3
17:101:1
18:114:0
18:114:2
20:52:3
20:57:2
20:93:2
20:140:1
20:317:0
.....

2. Word Expansion Vector (pattern matching and/or morphological and/or additional attributes)

dsal% echo ".*cuhri.*" | ./crapser
ćūh&rtod;ā-ćūh&rtod;ī
ćūh&rtod;ī
ćūhrī
where the search string ".*cuhri.*" was applied to a simplified search field and the result is the index form of the word.

3. Word Index Searching
Combine two vectors: objects and index entry forms. For each word occurring in the database we store the occurrence indices defining its logical location in the object hierarchy, such as volume, article, subarticle, paragraph, sentence, as well as the word itself. This makes it possible to perform complex context searches, which can be delimited by any identified structure, such as sentence, paragraph, subarticle and article.

4. Context Extraction and Rendering
Only time we actually touch text. Contextualization from objects to file byte offsets. Rendering is performed by a database specific set of rules, such as conversion of SGML/XML tagging into printable HTML, rendering object navigation and internal cross-reference links, getting inline images, Unicode display and so on.


Previous SlideOlsen: Words, Objects and Attributes: Slide 8Next Slide
Olsen: Words, Objects and Attributes
Previous SlideExtending Relational ObjectsNext Slide

Independence of the object hierarchy from data associated with elements in the hierarchy allows for: SQL tables can be generated or extracted from SGML/XML as a set of wrappers allowing multiple looks into a database.

Supports handling objects of different granularity and definition.

Chadwick-Healy's English Poetry:

with metadata/description of each which can be searched separately or in conjunction.

Consider the SGML representation of Schiller's letters as found in his complete works.

The metadata describes the digital volumes. Not a completely useful description of letters.

<letter author="Schiller, Friedrich von" gender=m n=1 id=23L1 tocdate="21. April 1772" date=17720421 date2=17720421 address1="Stoll, Elisabetha Margaretha, geb. Sommer" somhead="Band 23: Brief an Elisabetha Margaretha Stoll, 21. April 1772, (Nr. 1)" somlevel=1 comhead="Bd. 23: Briefwechsel: Brief an Stoll, Elisabetha Margaretha, geb. Sommer, 21. April 1772 (Nr. 1)">

Multiple descriptions are automatically generated by relational joins.

Practical consideration: I like to precompute joins for speed.

No theoretical limitation to number or size of tables that can be used to describe objects.


Previous SlideOlsen: Words, Objects and Attributes: Slide 9Next Slide
Olsen: Words, Objects and Attributes
Previous SlideExtractorsNext Slide

This is an evolving model, with small development steps, which has allowed us to work toward a more general production system.

Currently, the database load process is a single basically hard coded data extractor, which identifies, metadata, words and selected objects corresponding to the ARTFL Text Encoding specification.

This builds the indicies, objects, and general metadata for a wide variety of databases.

Additional descriptive object data and extensions to metadata is compiled by one of more extractors which builds SQL tables corresponding to the object hierarchy.

In theory, there are a couple of advantages:

And at least two downsides:


Previous SlideOlsen: Words, Objects and Attributes: Slide 10Next Slide
Olsen: Words, Objects and Attributes
Previous SlideSQL-SGML/XML linkerNext Slide

Object Model poses the operational question: what should be encoded as SGML/XML and when?

Objects in this model can have many attributes. Alexander Street Press databases feature extensive object attributes, some well over 100 fields per object.

<sp>
<speaker n="CH00166">STAMPFIELD</speaker>
Hello. Mose, what are you kicking about?
</sp>

CH00166 is related to an entry in a character database that has many fields, some of which are individual title specific and others which apply to characters across the entire collection of 1,000 plays (gender, occupation, nationality, sexual orientation) relationally joined to data pertaining to parent objects (acts, scenes, plays, etc.).

The architecture permits searching for all of the character dialogues by "black, female, heterosexual, teachers, in dramas published between 1980 and 2000", such as "Karen Swanson" in Rita Dove's The Siberian Village and "Rosa Steele" in Afaa Weavers' Rosa.

Operationally, it is far easier to maintain such associations directly rather than to encode this data in SGML/XML.

For export purposes, this could easily be done.

Assuming the relational integration of SGML/XML and SQL tables from database design and data capture, simplifies management of complex encoding projects.


Previous SlideOlsen: Words, Objects and Attributes: Slide 11Next Slide
Olsen: Words, Objects and Attributes
Previous SlideMultiple Object HierarchiesNext Slide

Most documents can be represented with one main object hierarchy -- book, chapter, verse -- and pagination as separate flat parallel objects.

PhiloLogic: pagination is not a real object but a secondary set of blocks used for contextualization from searches and page-turning.

Experiment: Secondary objects can be extended by relating one or more SQL tables to blocks of text. Implemented as an SQL wrapper to provide an alternative means of searching descriptive material.

Example: Text glosses and extensive indexing implemented as a large (95 field) table pointing to arbitrary blocks in text:

<part id="S2578-D002">
[some text, divs, etc]
</part>

SQL table contains indexing for flora, fauna, encounters, tribes, places, persons, etc. as well as housekeeping items like sequencing and cross language linking.

Mapping to and from main object hierarchy.

Useful because much of the indexing is too specific to be applied to the typical divisions found in these documents.


Previous SlideOlsen: Words, Objects and Attributes: Slide 12Next Slide
Olsen: Words, Objects and Attributes
Previous SlideMultiple Views and ThreadsNext Slide

Object model provides capability to consider large textual database from different views, eg. with searching for joined attributes across the entire database without regard to object depth.

SQL driven object management also supports multiple threading of objects across the database, eg.

Sequencing data can be added dynamically AND be used as search criteria.
Previous SlideOlsen: Words, Objects and Attributes: Slide 13Next Slide
Olsen: Words, Objects and Attributes
Previous SlideExtensive Attributes: ExampleNext Slide

Multiple tables describing contents. Early Encounters in North America has 7 tables, combined in various ways (Table). A couple of examples of SQL tables associated with SGML text
Previous SlideOlsen: Words, Objects and Attributes: Slide 14Next Slide
Olsen: Words, Objects and Attributes
Previous SlideExtensive Attributes: ImplicationsNext Slide

Object environment supports many related tables which can cover a wide variety of different types of contextual data.

Black Drama - 1850 to the Present (BLDR), for example, is currently built around the full-text of 200 plays by 60 authors (with a final projected size of some 1,200 plays). In addition to the full-text, BLDR has 8 metadata tables containing information describing:

The tables are generated by multiple joins from internal databases to incorporate data that is deemed to be useful in describing a particular feature, including elements that might be considered to be traditional metadata as well as data that is extracted from text data or associated with textual data.

Four? General categories

Extensive description of textual data can be integrated with any particular object by a series of (typically) precomputed joins.
Previous SlideOlsen: Words, Objects and Attributes: Slide 15Next Slide
Olsen: Words, Objects and Attributes
Previous SlideObject vs word attributesNext Slide

Implementation design question: should all objects be managed in SQL tables?

Many objects do not have significant attributes: notes, stage directions, etc.

For text searching we may want to be able to distinguish items in this class of objects.

Under discussion at ARTFL: a heuristic that would a field in the word occurrence index to identify context of words that occur in "non-attribute value" objects.

These are attributes, of course, that can be associated with a word. A stage direction is a real thing, it simply does not have any additional values typically associated with it.

"Non-attribute value" objects would appear in the standard object hierarchy and could be distinguished for display.

Currently an open question.


Previous SlideOlsen: Words, Objects and Attributes: Slide 16Next Slide
Olsen: Words, Objects and Attributes
Previous SlideTEIification?Next Slide

Interesting fact: full text systems rarely actually process text.

On load: SGML/XML is parsed to extract raw data for word occurrence indexes and SQL table generation.

In real time: processing text is only for formatting, linking to external resources.

Both functions are in highly isolated subsystems.

Retrofitting PhiloLogic to handle TEI-XML would, from my examination, require:

Current production implementation has a fixed object depth. Development implementation has flexible object depth capabilities (and a NOT operator).

Principle: All internal functions of PhiloLogic are text encoding independent and may also be character encoding independent (ISO-Latin, Unicode, character entities).


Previous SlideOlsen: Words, Objects and Attributes: Slide 17Next Slide
Olsen: Words, Objects and Attributes
Previous SlideConfusing Metadata and AnnotationNext Slide

Comment on original paper:
[...] there is a fundamental problem in that the paper is more about annotation than metadata. Much of the community is confused about what metadata really is, and this paper will only add to that confusion. The example given [...] is clearly an instance of an annotation of the text, not metadata. In TEI terms, metadata is the description of the resource as a whole that goes in the TEI Header. The example given is an instance of what the TEI calls "the analysis and interpretation of text".
The distinction on an encoding level is perfectly valid.

In a dynamic object implementation model, it appears that the distinction becomes hopelessly blurred.

Further: it appears to me there is a working assumption that SGML/XML will represent documents as files and that these will be the basic processing unit.

Object model: no necessary relationship between files, documents, and objects. All dynamically defined.

Insistence on the distinction, which "much of the community is confused about" may be restricting the way we think about how to implement systems to handle TEI encoded databases.


Previous SlideOlsen: Words, Objects and Attributes: Slide 18Next Slide
Olsen: Words, Objects and Attributes
Previous SlideTruth in AdvertisingNext Slide

Model for an implementation rather than a piece of software.

Could certainly leverage new generation XML tools for many aspects.

Currently using XSLT at ARTFL. Need to examine XPATH, etc.

PhiloLogic: ongoing development. It all works but not in a single coherent implementation. Various components performing various functions implemented in a number of different projects.

Working now to tie it all together with commercial and possibly academic collaborators.

Jelloware? Much more than a theory or design, less than a fully articulated system.


Previous SlideOlsen: Words, Objects and Attributes: Slide 19Next Slide
Olsen: Words, Objects and Attributes
Previous SlideConclusionTalk Home

The large-scale implementation problems posed by the nearly limitless expressive power of the TEI are subject to a number of different partial and full solutions.

The integration of SQL and full text in an object model provides one partial, relatively low cost, scalable, and robust approach to the problem.

I fully expect other solutions, using the array of new XML aware tools being created, may be of guidance to our own modernization of PhiloLogic.

I am hoping that the text object model outlined here will be of benefit to others working to leverage the current and next generation of richly encoded documents being created by so many academic and commercial organizations.


Previous SlideOlsen: Words, Objects and Attributes: Slide 20Talk Home