Subject: [xsl] Applying Streaming To DITA Processing: Looking for Guidance From: "Eliot Kimber ekimber@xxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> Date: Thu, 9 Oct 2014 14:16:08 -0000 |
In the context of DITA processing, where you have a "map" document that links to potentially many 1000s of topic documents in order to define a complete publication, I have an existing XSLT2 process that processes the entire data set, walking the map and processing each referenced document in turn, in order to construct a single XML structure that captures all the information needed to do numbering (or any similar publication-wide process, like index generation). This "data collection" process has the necessary effect of parsing every document ultimately referenced from the map, which can have a severe memory cost for large publications. This structure is then provided as a tunnel parameter to the next phase of processing, where the final deliverable result is generated (e.g., HTML pages for a Web site, EPUB, etc.). I know that with Saxon I could save memory today by discarding documents after the first phase but then I'd have to reparse them and that can incur a steep cost as well. But that would be a Saxon-specific optimization and I'm trying to avoid being tied to a specific XSLT implementation. My question with regard to streaming in this use case: Can streaming help, either with overall processing efficiency or with memory usage? Where would I go today or in the near future to gain the understanding of streaming required to answer these questions (other than the XSLT 3 spec itself, obviously)? Because my data collection process is copying data to a new result, I'm pretty sure it's inherently streamable: I'm just processing documents in an order determined by a normal depth-first tree walk of the map structure (a hierarchy of hyperlinks to topics) and grabbing relevant data (e.g., division titles, figure titles, index entries, etc.). If this was all I was doing, then for sure streaming would help memory usage. But because I must then process each topic again to generate the final result, and that process is not directly streamable, would streaming the first phase help overall? Taken a step further: are there implementation techniques I could apply in order to make the second phase streamable (e.g., collecting the information needed to render cross references without having to fetch the target elements) and could I expect that to then provide enough performance improvement to justify the implementation cost? The current code is both mature and relatively naive in its implementation. Reworking it to be streamable could entail a significant refactoring (maybe, that's part of what I'm trying to determine). The actual data processing cost is more or less fixed, so unless streaming makes the XSLT operations faster, I wouldn't expect streaming by itself to reduce processing time. However, the primary concern in this use case is memory usage: currently, memory required is proportional to the number of topics in the publication, whereas it could be limited to simply the largest topic plus the size of the collected data itself (which is obviously much smaller than the size of the topics as it includes the minimum data needed to enable numbering and such). Thanks, ELiot bbbbb Eliot Kimber, Owner Contrext, LLC http://contrext.com
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] problem with one example , Michael Kay mike@xxx | Thread | Re: [xsl] Applying Streaming To DIT, Jirka Kosek jirka@xx |
Re: [xsl] problem with one example , Michael Kay mike@xxx | Date | Re: [xsl] Applying Streaming To DIT, Jirka Kosek jirka@xx |
Month |