Hi Roger,
see below
Roger L. Cauvin wrote:
messages to XML. For various reasons (including troubleshooting), I would
like to log the content of the e-mails exactly. It sounds like that's
simply not possible in XML, at least to the extent that "text-only" can
include characters not allowed in XML.
I have a program that is receiving text-only e-mails and logging the
Text only is something that is very much misunderstood, simply because
there ain't such a thing as text-only. Is that IBM-437 format (that is:
original DOS). Is that ASCII? Some people call it plain old ascii when
it is in fact windows-1252. And many people don't know the differences
between windows-1252 and ISO-8859-1. If text-only means UTF-8, it will
contain a lot of "binary" looking bytes if viewed with windows-1252
encoding. If you have a mail from gmail, it will be send in UTF-7 (yes,
I know, strange). Which is yet another binary format for text-only.
And last but not least, if you have text-only in EBCDIC encoding, it
will look like a mess in Windows. But luckily, XML can handle all these
encodings, but parsers are only required to handle UTF-8 and UTF-16.
Saxon, in my experience, can handle ebcdic well, but not UTF-7 (but
that's to blame with Sun who's been refusing for ages to include UTF-7).
I inserted the ISO 8859-1 encoding declaration myself. Apparently, Saxon
6.3 doesn't support windows-1252 encoding. Saxon 8.9J, which I just now
installed, does appear to support that encoding. However, it still
(correctly) flags the U+18 character as illegal.
that does not depend on Saxon but on the Java version you use. Btw, why
would you want to use an age-old version of Saxon? Saxon 8.9 can do XML
1.1 and that can represent character U+18.
But you should workout what the encoding is your email program is using.
It may be that it stores the file in the same encoding as it is
received, but it is more likely that the email program transforms it
into some other format. If you read your email from the interface of the
email program (i.e., Thunderbird) then you will see what is done with
the code. But when you view your email in bare format, you'll have to
find out what the email program does to it. In the case of Thunderbird,
I believe it stores all in one file and encodes it as UTF-8, but I am
not sure (and you probably use a different mailer). Maybe TB even stores
different text formats in one file...
More about U+18
---------------
Unfortunately, without knowing what the byte sequence of that character
is and without a binary view of your whole file, it is quite hard to
determine what the real encoding is. Like David already suggested: you
should check the encoding in your file.
Furthermore, you are dealing with a lot of text it seems. If you want
more control over what encoding you can choose, you can try
unparsed-text($url, $encoding) in XSLT 2.0. Note that you must remove
the XML declaration then, because the spec states that the encoding in
the declaration has higher precedence than the user specified encoding.
If you want to check a bunch of encodings all at once and see if one
fits, you can lookup the list of supported encodings at Sun's website.
Make it into a sequence, i.e. <xsl:variable name="encodings" select="
'utf-8', 'utf-16', 'windows-1252', 'IBM500', 'Big5', 'utf-16BE' " />
and you can loop through all possibilities using:
<xsl:value-of select=" for $enc in $encodings return $enc,
unparsed-text-available($url, $enc) " separator=" " />
But to me, it sounds like you have encountered a mail in the lesser used
UTF-7 encoding. You were talking of the windows quote. Let's examine
that (assuming that you are correct in your analysis that the quote
appears where you have found U+18 (which is not U+18, but a byte with
hex value 18 that does not translate to the correct character in the
encoding that you guessed the file was in)), and this 18h is part of a
longer byte sequence where the smart quote is used. The smart code, I
hope, is ” (or &_#x201D if the browser/mailer screws it up),
which is the one MS Word uses.
You read the file using 'ISO-8859-1'. That means that the first 127
bytes equal there counterparts in the Unicode table (i.e. U+0 to U+7F
are the same as the bytes in the stream).
Now, let's see how the smart quote really should look like in certain
encodings that I think are likely to be encountered in mail (alle bytes
zijn in hex):
windows-1252: 93
iso-8859-1: 3F (question mark, i.e., cannot be represented)
Big5: A1A7
UTF-8: E2809C
UTF-16: 201C
UTF-16BE: 201C
UTF-16LE: 1C20
UTF-7: /v8gHA (not the hex representation, that would be:
2F7638674851)
Shift-Jis: 8167
IBM500: 3F (wrongly represented by the serializer: SUB is a
control code in IBM encodings)
GB 18030: A1B0
MacRoman: D2 (the mac codepage actually has a 'smart quote')
EUC-JP: A1C8
I got this information by combining saxon:serialize,
saxon:string-to-hexBinary and saxon:string-to-base64Binary (the last one
for creating the not supported and non-unicode standard UTF-7). As you
can see, there is no encoding in existence that has ever used byte 18 to
represent a high character.
The only situation that I can think of to legally have a byte 18 in a
sequence of bytes to represent a character, is in multi byte encoding
formats. I.e., UTF-16, GB 18030, Big5 etc. I believe that, because it is
a control character, that you will only find it as the second byte in
any two byte sequence and never as the first byte.
All this just for one character that is illegally encoded in the source?
Well, email programs, like I said, save their data in a variety of
formats. Or set the output options to XML 1.1 and use unparsed-text.
Hope this "little" story clarifies things a bit for you.
cheers,
-- Abel Braaksma