TEI stylesheet for converting Word docx files to TEI
This software is dual-licensed:
1. Distributed under a Creative Commons Attribution-ShareAlike 3.0
Unported License http://creativecommons.org/licenses/by-sa/3.0/
2. http://www.opensource.org/licenses/BSD-2-Clause
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
This software is provided by the copyright holders and contributors
"as is" and any express or implied warranties, including, but not
limited to, the implied warranties of merchantability and fitness for
a particular purpose are disclaimed. In no event shall the copyright
holder or contributors be liable for any direct, indirect, incidental,
special, exemplary, or consequential damages (including, but not
limited to, procurement of substitute goods or services; loss of use,
data, or profits; or business interruption) however caused and on any
theory of liability, whether in contract, strict liability, or tort
(including negligence or otherwise) arising in any way out of the use
of this software, even if advised of the possibility of such damage.
The main template that starts the conversion from docx to TEI
IMPORTING STYLESHEETS AND OVERRIDING MATCHED TEMPLATES:
When importing a stylesheet (xsl:import) all the templates
in the imported stylesheet get a lower import-precedence than
the ones in the importing stylesheet. If the importing
stylesheet wants to override, let's say a general template to
match all <w:p> elements where no more specialized rule
applies it can't, since it will automatically override all
w:p[somepredicate] template in the imported stylesheet as
well. In this case we have outsourced the processing of the
general template into a named template and all the imported
stylesheet does is to call the named template. Now, the
importing stylesheet can simply override the named template,
and everything works out fine.
See templates: - w:p (mode: paragraph)
Modes:
pass0: a normalization process for styles. Can also
detect illegal styles.
pass2: templates that apply in the second stage
of the conversion, cleaning TEI elements created in the
first ise."
inSectionGroup: defines a template that works on a
group of consecutive elements (w:p or w:tbl elements) that
form a section (a normal section, not to be confused with
w:sectPr).
paragraph: defines that the template
works on an individual element (usually
starting with a w:p element).
<xsl:template match="/"><!-- Do an initial normalization and store everything in $pass0 --><xsl:if test="not(doc-available($relsFile))"><xsl:message terminate="yes">The file<xsl:value-of select="$relsFile"/>cannot be read</xsl:message></xsl:if><xsl:if test="not(doc-available($styleDoc))"><xsl:message terminate="yes">The file<xsl:value-of select="$styleDoc"/>cannot be read</xsl:message></xsl:if><xsl:variable name="pass0"><xsl:apply-templates mode="pass0"/></xsl:variable><!-- Do the main transformation and store everything in the variable pass1 --><xsl:variable name="pass1"><xsl:for-each select="$pass0"><xsl:apply-templates/></xsl:for-each></xsl:variable><!-- This pass simply gets rid of empty <tei:hi>s to avoid unwanted
processing in step2. If similar adjustments will become necessary,
we suggest to add them to this step. --><xsl:variable name="pass1hi"><xsl:for-each select="$pass1"><xsl:apply-templates mode="pass1hi"/></xsl:for-each></xsl:variable><!-- Do the final parse and create valid TEI --><xsl:apply-templates select="$pass1hi" mode="pass2"/><xsl:call-template name="fromDocxFinalHook"/></xsl:template>
<xsl:template match="w:document"><TEI><!-- create teiHeader --><xsl:call-template name="create-tei-header"/><!-- convert main and back matter --><xsl:apply-templates select="w:body"/></TEI></xsl:template>
<xsl:template match="w:body"><text><!-- Create forme work --><xsl:call-template name="extract-forme-work"/><!-- create TEI body --><body><xsl:call-template name="mainProcess"/></body></text></xsl:template>
<xsl:template name="mainProcess"><xsl:param name="extrarow" tunnel="yes"/><xsl:param name="extracolumn" tunnel="yes"/><!--
group all paragraphs that form a first level section.
--><xsl:for-each-group select="w:sdt|w:p|w:tbl" group-starting-with="w:p[tei:isFirstlevel-heading(.)]"><xsl:choose><!-- We are dealing with a first level section, we now have
to further divide the section into subsections that we can then
finally work on --><xsl:when test="tei:is-heading(.)"><xsl:call-template name="group-by-section"/></xsl:when><xsl:when test="tei:is-front(.)"><front><xsl:apply-templates select="." mode="inSectionGroup"/></front></xsl:when><!-- We have found some loose paragraphs. These are most probably
front matter paragraps. We can simply convert them without further
trying to split them up into sub sections. --><xsl:otherwise><xsl:apply-templates select="." mode="inSectionGroup"/></xsl:otherwise></xsl:choose></xsl:for-each-group><!-- I have no idea why I need this, but I apparently do.
//TODO: find out what is going on--><xsl:apply-templates select="w:sectPr" mode="paragraph"/></xsl:template>
There are certain elements that we don't really care about, but that
force us to regroup everything from the next sibling on.
@see grouping in construction of headline outline.
Grouping consecutive elements that belong together
We are now working on a group of all elements inside some group bounded by
headings. These need to be further split up into smaller groups for figures,
list etc. and into individual groups for simple paragraphs...
<xsl:template match="w:sdt|w:tbl|w:p" mode="inSectionGroup"><!--
We are looking for:
- Lists -> 1
- Table of Contents -> 2
- Figures -> 3
Anything else is assigned a number of position()+100. This should be
sufficient even if we find lots more things to group.
--><xsl:for-each-group select="current-group()" group-adjacent="if (tei:is-list(.)) then 1 else if (tei:is-toc(.)) then 2 else if (tei:is-figure(.)) then 3 else if (tei:is-line(.)) then 4 else if (tei:is-caption(.)) then 5 else if (tei:is-front(.)) then 6 else position() + 100"><!-- For each defined grouping call a specific template. If there is no
grouping defined, apply templates with mode
paragraph --><xsl:choose><xsl:when test="current-grouping-key()=1"><xsl:call-template name="listSection"/></xsl:when><xsl:when test="current-grouping-key()=2"><xsl:call-template name="tocSection"/></xsl:when><xsl:when test="current-grouping-key()=3"><xsl:call-template name="figureSection"/></xsl:when><xsl:when test="current-grouping-key()=4"><xsl:call-template name="lineSection"/></xsl:when><xsl:when test="current-grouping-key()=5"><xsl:call-template name="captionSection"/></xsl:when><xsl:when test="current-grouping-key()=6"><xsl:call-template name="frontSection"/></xsl:when><!-- it is not a defined grouping .. apply templates --><xsl:otherwise><xsl:apply-templates select="." mode="paragraph"/></xsl:otherwise></xsl:choose></xsl:for-each-group></xsl:template>
Looks through the document to find forme work related sections.
Creates a <fw> element for each forme work related section. These include
running headers and footers. The corresponding elements in OOXML are w:headerReference
and w:footerReference. These elements only define a reference that to a header or
footer definition file. The reference itself is resolved in the file word/_rels/document.xml.rels.
<xsl:template match="w:hyperlink"><!-- hyperlinks that do not contain any children should *probably* be omitted as in Word they result in nothing visible at all --><xsl:if test="child::node()"><xsl:variable name="target"><xsl:variable name="rid" select="@r:id"/><xsl:choose><xsl:when test="ancestor::w:endnote"><xsl:value-of select="document(concat($wordDirectory,'/word/_rels/endnotes.xml.rels'))//rel:Relationship[@Id=$rid]/@Target"/></xsl:when><xsl:when test="ancestor::w:footnote"><xsl:value-of select="document(concat($wordDirectory,'/word/_rels/footnotes.xml.rels'))//rel:Relationship[@Id=$rid]/@Target"/></xsl:when><xsl:otherwise><xsl:value-of select="document($relsDoc)//rel:Relationship[@Id=$rid]/@Target"/></xsl:otherwise></xsl:choose></xsl:variable><xsl:variable name="anchor" select="@w:anchor"/><ref target="{string-join(($target, $anchor), '#')}"><xsl:apply-templates/></ref></xsl:if></xsl:template>
<xsl:template name="create-tei-header"><teiHeader><fileDesc><titleStmt><title><xsl:call-template name="getDocTitle"/></title><author><xsl:call-template name="getDocAuthor"/></author></titleStmt><editionStmt><edition><date><xsl:call-template name="getDocDate"/></date></edition></editionStmt><publicationStmt><p>unknown</p></publicationStmt><sourceDesc><p>Converted from a Word document</p></sourceDesc></fileDesc><encodingDesc><xsl:call-template name="generateAppInfo"/></encodingDesc><revisionDesc><listChange><change><date><xsl:value-of select="tei:whatsTheDate()"/></date><name><xsl:call-template name="getDocAuthor"/></name></change></listChange></revisionDesc></teiHeader></xsl:template>