Re: [xsl] Flattening characters to plain latin

Subject: Re: [xsl] Flattening characters to plain latin
From: Abel Braaksma <abel.online@xxxxxxxxx>
Date: Fri, 16 Feb 2007 14:05:47 +0100
Colin Paul Adams wrote:
    >> codepoints-to-string(string-to-codepoints(normalize-unicode($in,
    >> 'NFKD'))[.  lt 127])


No. Not unless you correct the error in it first.


What error?

The following:
codepoints-to-string(string-to-codepoints(normalize-unicode('@ABCDEFGHIJKLMNOPQRSTUV', 'NFKD'))[. le 127])


returns "AAAAAACEEEEIIIINOOOOO"

it misses the P and F, but I am not well-educated enough to understand normalize-unicode NFKD algorithms and whether that is an error or not. In addition, I changed lt to le, but I was under the impression that codepoint 127 was not part of Latin-1. The code itself was correct, but the definition of "plain latin" from the OP perhaps needs some clarification.


But here's one that removes all punctuation, but leaves alone the other symbols, like 0, . and ', but also the missing F and P


codepoints-to-string(
   string-to-codepoints(
       normalize-unicode('@ABCDEFGHIJKLMNOPQRSTUV0.'', 'NFKD'))
   [replace(codepoints-to-string(.), '[\p{M}]', '')])

This returns "AAAAAAFCEEEEIIIIPNOOOOO0.'"

but is not even close to as pretty as Michael's! Note the double codepoints-to-string, (which make it u-u-u-u-gly!). The alternative, a replace on the cpts of the whole stcp+normalize, would automatically normalize the results back before the regular expression can do its work. But like I said, it is u-u-u-ugly!

-- Abel

PS: hope the mailer does not mess too much with the high Latin-1 characters....

Current Thread