Hello all,
several months ago I asked for help with the following task:
Reading up WordprocessingML (from Word 2003), I obtain text runs with
inline styles attached as leaf nodes like this:
<w:r>
<w:rPr>
<w:i/>
<w:b/>
</w:rPr>
<w:t>This is text in bold and italic.</w:t>
</w:r>
In my output, however, the inline styles should nest, and moreover, nest
in a particular order:
<run><b><i>This is text in bold and italic.</i></b></run>
With the valuable help from the list, especially from David and Wendell,
I managed to craft an XSLT 2.0 stylesheet module that does the job quite
well; I have attached a demo version of it below (I can post the full,
richly commented version if someone is interested), together with a
sample input file.
Now I need to enhance it a little bit, in order to cater for cases where
some inline style may have different "indicators", which currently
interfere with each other. An example would be superscript style which
manifests itself in the presence of (at least) one of these 3 child
element sequences within w:r:
A) <w:vertAlign w:val="superscript"/>
B) <w:position w:val="6"/>
C) <w:position w:val="6"/><w:vertAlign w:val="superscript"/>
While A) works well, B) and C) receive multiple <sup></sup> containers,
see the result of applying sample XSL and XML. I know I need to give up
considering each child of w:r separately for looking for sequences (or
rather sth like unordered node sets?) inside.
So I am facing two questions:
1) Which is the best way to replace the one-by-one comparison in
<xsl:when test="some $style_repr in w:rPr/*
satisfies
deep-equal($style_repr, $current_style_wordml_repr)">
with an algorithm that is capable of comparing node sets?
2) I suppose I will have to be able to delete the child elements from
w:r which were already matched, to prevent cases A and B above from
matching when case C already matched. (Currently, I am doing without
because I considered the matching patterns being mutually exclusive.) I
think it would be easiest to make the w:r instance, which is now
accessed as context node, into a parameter to allow for modifying it. Is
there a more elegant way?
Yves
===== wordml_phys_run_styles.xsl =====
<?xml version="1.0" encoding="iso-8859-1"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"
xmlns:lookup="http://xmlns.srz.de/yforkl/xslt/lookup"
exclude-result-prefixes="lookup w"
version="2.0">
<lookup:wordml_styles_table>
<lookup:wordml_phys_style_repr r_equiv="b">
<w:b/>
</lookup:wordml_phys_style_repr>
<lookup:wordml_phys_style_repr r_equiv="i">
<w:i/>
</lookup:wordml_phys_style_repr>
<!-- Uncomment this to try a naive approach to recognize superscript by two
child elements at once -->
<!--
<lookup:wordml_phys_style_repr r_equiv="sup">
<w:position w:val="6"/>
<w:vertAlign w:val="superscript"/>
</lookup:wordml_phys_style_repr>
-->
<lookup:wordml_phys_style_repr r_equiv="sup">
<w:vertAlign w:val="superscript"/>
</lookup:wordml_phys_style_repr>
<lookup:wordml_phys_style_repr r_equiv="sup">
<w:position w:val="6"/>
</lookup:wordml_phys_style_repr>
</lookup:wordml_styles_table>
<xsl:template match="w:p">
<sample>
<xsl:apply-templates/>
</sample>
</xsl:template>
<xsl:template match="w:r">
<xsl:call-template name="convert_phys_run_styles"/>
</xsl:template>
<!-- Convert physical style runs and text of w:r as context node into
nested
inline elements -->
<xsl:template name="convert_phys_run_styles">
<xsl:call-template name="add_style">
<!-- pass a sequence of the representations of all possible
physical run
styles; the order of its items reflects the style nesting
hierarchy
defined by the target structure; context node is the w:r which
may have physical styles applied -->
<xsl:with-param name="available_styles_sequence"
select="
document('')/
xsl:stylesheet/
lookup:wordml_styles_table/
lookup:wordml_phys_style_repr"/>
</xsl:call-template>
</xsl:template>
<!-- Add element for current style from hierarchy, if it is active -->
<xsl:template name="add_style">
<xsl:param name="available_styles_sequence"/>
<xsl:choose>
<xsl:when test="empty($available_styles_sequence)">
<!-- (Children: usually either w:t or w:sym, holding text only) -->
<xsl:apply-templates select="w:t"/>
</xsl:when>
<xsl:otherwise>
<xsl:variable name="current_style_wordml"
select="$available_styles_sequence[1]"/>
<xsl:variable name="current_style_wordml_repr"
select="$current_style_wordml/*[1]"/>
<xsl:choose>
<!-- try to find the child of w:rPr that is the WordML
representation
of the run style currently being looked for (using an
existentially quantified comparison because the order of
w:rPr's
children is free and "deep-equal" can only compare single
nodes); if the current style matches, add the style's
element
equivalent around the inner styles and text -->
<!-- ### limitation: does not support indicators composed of
several
nodes with particular relationships (e.g. a specific
order or
configurations of nodes to be interpreted specially),
i.e. only
atomic recognition of the nodes is possible -->
<xsl:when test="some $style_repr in w:rPr/*
satisfies
deep-equal($style_repr, $current_style_wordml_repr)">
<xsl:element name="{$current_style_wordml/@r_equiv}">
<xsl:call-template name="add_style">
<xsl:with-param name="available_styles_sequence"
select="remove($available_styles_sequence, 1)"/>
</xsl:call-template>
</xsl:element>
</xsl:when>
<xsl:otherwise>
<xsl:call-template name="add_style">
<xsl:with-param name="available_styles_sequence"
select="remove($available_styles_sequence, 1)"/>
</xsl:call-template>
</xsl:otherwise>
</xsl:choose>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
===== wordml_phys_run_styles.xml =====
<?xml version="1.0" encoding="ISO-8859-1"?>
<w:p xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml">
<w:r>
<w:rPr>
<w:i/>
<w:vertAlign w:val="superscript"/>
</w:rPr>
<w:t>This italic + superscript is always fine</w:t>
</w:r>
<w:r>
<w:rPr/>
<w:t> but </w:t>
</w:r>
<w:r>
<w:rPr>
<w:i/>
<w:position w:val="6"/>
</w:rPr>
<w:t>that italic + superscript has maybe one "sup" container too
much</w:t>
</w:r>
<w:r>
<w:rPr/>
<w:t> while </w:t>
</w:r>
<w:r>
<w:rPr>
<w:i/>
<w:position w:val="6"/>
<w:vertAlign w:val="superscript"/>
</w:rPr>
<w:t>this italic + superscript has either a double or triple "sup"
container!</w:t>
</w:r>
</w:p>