Talk Home | Introduction | Next Slide |
Wrong Place? Wrong Time? Or just plain wrong?
The expressive power of the Text Encoding Initiative Guidelines poses significant implementation challenges for large database systems.
Groping my way to a possible general solution that can be implemented on large databases.
Text object model using related SQL tables to manage object hierarchies in conjunction with a full text indexing system that is also aware of object hierarchies for searching and retrieval.
This talk:
Caveat emptor: it all works, in one version or another. Not yet a coherent system
Talk Home | Olsen: Words, Objects and Attributes: Slide 1 | Next Slide |
Previous Slide | Problematic | Next Slide |
Ability to define almost any type of textual object in almost any hierarchical configuration with almost any set of attributes.
Expression of TEI in XML: supports wide array of XML tools for document management, navigation, display.
Not clear that these tools will scale to handle large databases. Existing systems are expensive and not aimed at academic environment.
Object model (in PhiloLogic): handles most problems so far encountered, and may be logically extended as a model to handle more.
Previous Slide | Olsen: Words, Objects and Attributes: Slide 2 | Next Slide |
Previous Slide | SQL and Text Objects | Next Slide |
Textual objects may range from the individual word to databases containing many collections.
Object management is handled by one or more dedicated subsystems configured to use the full power of SQL, including arithmetic, logical, and pattern matching operations.
Marriage of full text and SQL has the advantage of using two distinct components that are robust, fast, and scalable.
PhiloLogic word --> object indexing scheme has been used extensively in many document types and language groups.
Previous Slide | Olsen: Words, Objects and Attributes: Slide 3 | Next Slide |
Previous Slide | Text = Objects and Attributes | Next Slide |
Typically considered in as tree, one may express more than one hierarchy of objects: e.g. "book-chapter-verse" and "pages"
So far, words are the lowest (atomic) level of object.
A word index is then a logical address of nested objects to the word object, e.g.:
Volume--> Book--> Chapter--> SubChapter--> paragraph--> sentence--> phrase--> wordFor indexing purposes, words are simply given logical addresses without reference to object type or any other data.
Implied object divisions: calculate phrases, sentences, etc. from context if explicit tagging is unavailable.
Previous Slide | Olsen: Words, Objects and Attributes: Slide 4 | Next Slide |
Previous Slide | Text Objects in Philo | Next Slide |
These can be used to
Descriptive data associated with this object is stored in an SQL table
Object attributes are stored in SQL tables and can have different values at different levels.
Headword | Type | Author | Class of Knowledge | P.S. | Vol:Obj | Obj ID |
LUNE | artm | d'Alembert | Astr. | s.f. | 9:3472 | 70:12:0 |
Lune | arts | XXX | Chimie. | NA | 9:3473 | 70:12:1 |
Lune | arts | Venel | Chimie. | NA | 9:3474 | 70:12:2 |
Lune | arts | XXX | Hist. nat. Chimie, Metallurgie a Mineralogie. | NA | 9:3475 | 70:12:3 |
Lune cornée | arts | XXX | Chimie Metall. | NA | 9:3476 | 70:12:4 |
Lune | arts | Jaucourt | Mythologie. | NA | 9:3477 | 70:12:5 |
Previous Slide | Olsen: Words, Objects and Attributes: Slide 5 | Next Slide |
Previous Slide | Word Objects in PhiloLogic | Next Slide |
Internal raw index (from Platts' A Dictionary of Urdu, Classical Hindi, and English):
word Desire 0 212 0 1 1 0 62070 6 word delight 0 212 0 1 1 1 62078 6 word pleasure 0 212 0 1 1 2 62087 6 word taste 0 212 0 1 1 3 62097 6 word relish 0 212 0 1 1 4 62104 6showing the object hierarchy to the word, with byte offset and page tag. This corresponds to the dictionary entry (shown in non-Unicode display option):
abhiruc (p. 6)The index form of each word serves as the relational operator between the object hierarchy and word object management subsystem.
Sabhiruc abhiruc, s.f. Desire, delight, pleasure, taste, relish. --- abhiruc rakhna (-men). To have or take delight (in).
word upanām 0 595 0 1 0 3 143854 0012 word upanām-vāćī-śabd 0 595 0 1 1 5 143924 0012 word a&ttod;hmanā 0 908 0 1 1 4 226849 0018 word a&ttod;kal-paććū 0 974 0 1 2 3 242980 0019 word &ttodtod;ara&htod;-se 0 1293 0 1 11 17 344921 0027Entry forms may include any set of characters, including arbitrary SGML entities, ISCII, and Unicode representations of characters, as required.
Word object management subsystem(s) serve as a way to allow the user to
kalaka | kā<od;aka |
kalyanani | kalyā&ntod;āni |
susrusin | śuśrū&stod;in |
Or other representations for search and display
This can be logically extended by adding fields containing, for example, word lemmas and other appropriate lexical information to be used for searching, possibly including language attributes or values for objects in which words are to be distinguished but that do not have attributes (notes?).
Previous Slide | Olsen: Words, Objects and Attributes: Slide 6 | Next Slide |
Previous Slide | Object Hierarchies | Next Slide |
Object hierarchies are completely independent of types or any other information.
35:78:0:51 is a paragraph in an encyclopedia while 140:6:1:3 is a chapter (Chapitre III. La famille) of Sagnac's Législation civile....
The nested hierarchy may be displayed as a table of contents, such as document/file 140
Object Types are managed in the SQL handlers rather than in word indicies. This means that the object types, nesting rules, and the SGML/XML encoding used to represent this information is independent of the identified objects.
No clear technical distinction between
Previous Slide | Olsen: Words, Objects and Attributes: Slide 7 | Next Slide |
Previous Slide | Tying it all together: relational objects | Next Slide |
index entry of the wordAdd a housekeeping relationship: an object to byte offset map.
logical address of the word, including all parent objects
Everything else logically follows from these three things.
The four step program:
1. Object Selection
colet% ./gimme "author=venel" "class=chim" | head 15:184:1 17:80:3 17:101:1 18:114:0 18:114:2 20:52:3 20:57:2 20:93:2 20:140:1 20:317:0 .....
2. Word Expansion Vector (pattern matching and/or morphological and/or
additional attributes)
where the search string ".*cuhri.*" was applied to a simplified search field and the result is the index form of the word.dsal% echo ".*cuhri.*" | ./crapser ćūh&rtod;ā-ćūh&rtod;ī ćūh&rtod;ī ćūhrī
3. Word Index Searching
Combine two vectors: objects and index entry forms.
For each word occurring in the database we store the
occurrence indices defining its logical location in the object
hierarchy, such as volume, article, subarticle, paragraph, sentence, as
well as the word itself. This makes it possible to perform complex
context searches, which can be delimited by any identified structure,
such as sentence, paragraph, subarticle and article.
4. Context Extraction and Rendering
Only time we actually touch text. Contextualization from objects
to file byte offsets. Rendering is performed by a database specific
set of rules, such as conversion of SGML/XML tagging into printable
HTML, rendering object navigation and internal cross-reference links,
getting inline images, Unicode display and so on.
Previous Slide | Olsen: Words, Objects and Attributes: Slide 8 | Next Slide |
Previous Slide | Extending Relational Objects | Next Slide |
Supports handling objects of different granularity and definition.
Chadwick-Healy's English Poetry:
Consider the SGML representation of Schiller's letters as found in his complete works.
The metadata describes the digital volumes. Not a completely useful description of letters.
<letter author="Schiller, Friedrich von" gender=m n=1 id=23L1 tocdate="21. April 1772" date=17720421 date2=17720421 address1="Stoll, Elisabetha Margaretha, geb. Sommer" somhead="Band 23: Brief an Elisabetha Margaretha Stoll, 21. April 1772, (Nr. 1)" somlevel=1 comhead="Bd. 23: Briefwechsel: Brief an Stoll, Elisabetha Margaretha, geb. Sommer, 21. April 1772 (Nr. 1)">
Multiple descriptions are automatically generated by relational joins.
Practical consideration: I like to precompute joins for speed.
No theoretical limitation to number or size of tables that can be used to describe objects.
Previous Slide | Olsen: Words, Objects and Attributes: Slide 9 | Next Slide |
Previous Slide | Extractors | Next Slide |
Currently, the database load process is a single basically hard coded data extractor, which identifies, metadata, words and selected objects corresponding to the ARTFL Text Encoding specification.
This builds the indicies, objects, and general metadata for a wide variety of databases.
Additional descriptive object data and extensions to metadata is compiled by one of more extractors which builds SQL tables corresponding to the object hierarchy.
In theory, there are a couple of advantages:
And at least two downsides:
Previous Slide | Olsen: Words, Objects and Attributes: Slide 10 | Next Slide |
Previous Slide | SQL-SGML/XML linker | Next Slide |
Objects in this model can have many attributes. Alexander Street Press databases feature extensive object attributes, some well over 100 fields per object.
<sp>
<speaker n="CH00166">STAMPFIELD</speaker>
Hello. Mose, what are you kicking about?
</sp>
CH00166 is related to an entry in a character database that has many fields, some of which are individual title specific and others which apply to characters across the entire collection of 1,000 plays (gender, occupation, nationality, sexual orientation) relationally joined to data pertaining to parent objects (acts, scenes, plays, etc.).
The architecture permits searching for all of the character dialogues by "black, female, heterosexual, teachers, in dramas published between 1980 and 2000", such as "Karen Swanson" in Rita Dove's The Siberian Village and "Rosa Steele" in Afaa Weavers' Rosa.
Operationally, it is far easier to maintain such associations directly rather than to encode this data in SGML/XML.
For export purposes, this could easily be done.
Assuming the relational integration of SGML/XML and SQL tables from database design and data capture, simplifies management of complex encoding projects.
Previous Slide | Olsen: Words, Objects and Attributes: Slide 11 | Next Slide |
Previous Slide | Multiple Object Hierarchies | Next Slide |
PhiloLogic: pagination is not a real object but a secondary set of blocks used for contextualization from searches and page-turning.
Experiment: Secondary objects can be extended by relating one or more SQL tables to blocks of text. Implemented as an SQL wrapper to provide an alternative means of searching descriptive material.
Example: Text glosses and extensive indexing implemented as a large (95 field) table pointing to arbitrary blocks in text:
<part id="S2578-D002"> [some text, divs, etc] </part>
SQL table contains indexing for flora, fauna, encounters, tribes, places, persons, etc. as well as housekeeping items like sequencing and cross language linking.
Mapping to and from main object hierarchy.
Useful because much of the indexing is too specific to be applied to the typical divisions found in these documents.
Previous Slide | Olsen: Words, Objects and Attributes: Slide 12 | Next Slide |
Previous Slide | Multiple Views and Threads | Next Slide |
SQL driven object management also supports multiple threading of objects across the database, eg.
Previous Slide | Olsen: Words, Objects and Attributes: Slide 13 | Next Slide |
Previous Slide | Extensive Attributes: Example | Next Slide |
Previous Slide | Olsen: Words, Objects and Attributes: Slide 14 | Next Slide |
Previous Slide | Extensive Attributes: Implications | Next Slide |
Black Drama - 1850 to the Present (BLDR), for example, is currently built around the full-text of 200 plays by 60 authors (with a final projected size of some 1,200 plays). In addition to the full-text, BLDR has 8 metadata tables containing information describing:
Four? General categories
Previous Slide | Olsen: Words, Objects and Attributes: Slide 15 | Next Slide |
Previous Slide | Object vs word attributes | Next Slide |
Many objects do not have significant attributes: notes, stage directions, etc.
For text searching we may want to be able to distinguish items in this class of objects.
Under discussion at ARTFL: a heuristic that would a field in the word occurrence index to identify context of words that occur in "non-attribute value" objects.
These are attributes, of course, that can be associated with a word. A stage direction is a real thing, it simply does not have any additional values typically associated with it.
"Non-attribute value" objects would appear in the standard object hierarchy and could be distinguished for display.
Currently an open question.
Previous Slide | Olsen: Words, Objects and Attributes: Slide 16 | Next Slide |
Previous Slide | TEIification? | Next Slide |
On load: SGML/XML is parsed to extract raw data for word occurrence indexes and SQL table generation.
In real time: processing text is only for formatting, linking to external resources.
Both functions are in highly isolated subsystems.
Retrofitting PhiloLogic to handle TEI-XML would, from my examination, require:
Principle: All internal functions of PhiloLogic are text encoding independent and may also be character encoding independent (ISO-Latin, Unicode, character entities).
Previous Slide | Olsen: Words, Objects and Attributes: Slide 17 | Next Slide |
Previous Slide | Confusing Metadata and Annotation | Next Slide |
[...] there is a fundamental problem in that the paper is more about annotation than metadata. Much of the community is confused about what metadata really is, and this paper will only add to that confusion. The example given [...] is clearly an instance of an annotation of the text, not metadata. In TEI terms, metadata is the description of the resource as a whole that goes in the TEI Header. The example given is an instance of what the TEI calls "the analysis and interpretation of text".The distinction on an encoding level is perfectly valid.
In a dynamic object implementation model, it appears that the distinction becomes hopelessly blurred.
Further: it appears to me there is a working assumption that SGML/XML will represent documents as files and that these will be the basic processing unit.
Object model: no necessary relationship between files, documents, and objects. All dynamically defined.
Insistence on the distinction, which "much of the community is confused about" may be restricting the way we think about how to implement systems to handle TEI encoded databases.
Previous Slide | Olsen: Words, Objects and Attributes: Slide 18 | Next Slide |
Previous Slide | Truth in Advertising | Next Slide |
Could certainly leverage new generation XML tools for many aspects.
Currently using XSLT at ARTFL. Need to examine XPATH, etc.
PhiloLogic: ongoing development. It all works but not in a single coherent implementation. Various components performing various functions implemented in a number of different projects.
Working now to tie it all together with commercial and possibly academic collaborators.
Jelloware? Much more than a theory or design, less than a fully articulated system.
Previous Slide | Olsen: Words, Objects and Attributes: Slide 19 | Next Slide |
Previous Slide | Conclusion | Talk Home |
The integration of SQL and full text in an object model provides one partial, relatively low cost, scalable, and robust approach to the problem.
I fully expect other solutions, using the array of new XML aware tools being created, may be of guidance to our own modernization of PhiloLogic.
I am hoping that the text object model outlined here will be of benefit to others working to leverage the current and next generation of richly encoded documents being created by so many academic and commercial organizations.
Previous Slide | Olsen: Words, Objects and Attributes: Slide 20 | Talk Home |