Hi List people,
I happen to have the following requirement: parsing files in a
non-unicode encoding that contain numeric escape references to
codepoints in a non-unicode codepage. Let me explain with a well-known
example as encoding of a URI. The following is an illegal URI:
http://somesite.com/request?city=%C5rhus
which means: http://somesite.com/request?city=Erhus (the first
character after city is a Latin Captial Letter A With Ring Above,
U+00C5), and should've been encoded as
http://somesite.com/request?city=%C3%85rhus to be a correct URI (%C3%85
represents a UTF-8 byte sequence).
In this example, the wrongly encoded URI meant for '%C5' a codepoint in
the ISO-8859-1 table at point 197. In this example, this translates
directly to U+00C5 by using codepoints-to-string (and some additional
translation for the hex-to-dec conversion).
Alas, I have plenty of this stuff, and most them are not URIs, but
arbitrary data, and most of the encodings are not even ISO-8859-1, but
some other, less common encoding, like ISO-8859-7 (Greek). If the
example above were encoded in ISO-8859-7, the C5 would represent the
letter Epsilon, U+0395.
My problem is with strings containing hex or decimal numbers pointing to
a codepoint in an arbitrary encoding (the encoding is known beforehand).
This may look as follows for the Greek encoding:
[C5]psilon
%C5psilon
<C5>psilon
and I would like to translate 'C5' using the 'ISO-8859-7' encoding by a
function similar to codepoints-to-string(), but now the codepoint
represents a code in some other codepage than Unicode (here: ISO-8859-7).
A lot of functions in XSLT, when dealing with serialization or reading
sources, can deal with encodings, but not codepoints-to-string, which
only deals with Unicode. Is anyone aware of the reasons why this is
done? Why there is no function defined like:
fn:string-to-codepoints($arg as xs:string?, $enc as xs:string?) as
xs:integer*
where $enc is the encoding? Is this the result of the encoding bits
being part of XSLT where XPath only deals with Unicode? My guess would
be that XPath only sees strings of unicode characters and is unaware of
any encoding issues, at any time.
Apart from these questions, is anybody aware of a resolution to my
problem? Most likely I need an extension function, am I right?
Thanks for any thoughts on this,
Cheers,
-- Abel