![]()
From: Jose Kahan (jose.kahan@w3.org)
Date: Thu Apr 10 2003 - 11:58:40 CDT
Hi folks,
I just commited my changes for upgrading hypermail to XHTML 1.0 Strict.
This is part of my work for adding WAI enhancements to hypermail. This
first commit only concerns the ugprade to XHTML. The next commits will
be for the WAI enhancements once they're ready.
Before doing the commit I added a tag "Before_XHTML" to the source
code so that in case of problems, we can find out easier what
went wrong.
I made three kind of changes. Users who
have used HTML templates to customize their archives should upgrade
them to the XHTML syntax in order to have valid documents. Here's
a list of the most common changes I did:
1. Well-formed documents (respecting the XML syntax):
* All elements need to have an end tag.
<li>something
is now <li>something</li>
* Single elements like <br> become <br />
* All attributes need to have a value. If there was no value before,
they take the name of the attribute.
* The -- sequence is forbidden inside a comment.
2. Valid XHTML documents (according to the strict DTD):
* <ul><li><ul><li>something</ul></ul> has become
<ul><li><ul><li>something</li></ul></li></ul>
* It's invalid to have an empty <ul>
<ul></ul>
has become
<ul>><li style="display: none"></li></ul>
* The <u> (underline) tag has been deprecated. We were only using it
in tables. I removed it.
3. Charset problems (related to both XHTML, HTML, and XML):
Many mail clients specify one kind of charset (often ISO-8859-1), but
include other characters belonging to other charsets (often WinLatin1)
in the message body (note that there is a way to combine charsets
in the headers that we already take care of). I noticed this problem
while validating the XHTML changes. In order to get this working, I
added some code so that WinLatin1 chars be coded into the
respective Unicode entities.
For example, in an ISO-8859-1, the 0x80 character is invalid. I assume
that it belongs to WinLatin1 and convert it to €, which
is the equivalent entity. This character can now happily live inside
an ISO-8859-1 document.
In order to achieve the above, I modified the API of some functions
so that we can pass the value of the charset and do the convertion
when needed.
When a message has no charset, I assumed it was ISO-8859-1.
The things that won't work yet is when an archive has messages belonging
to different charsets. Let's suppose that each subject is written with
a different charset. We can't say anymore that the subject.html index
belongs to ISO-8859-1 or something else without transcoding the
characters. I only added a transcoding for WinLatin1. A longer term
solution will be to move on to UTF-8, which solves those problems.
The only drawback of having moved to XHTML is that when we parse a
generated XHTML message that mixes charsets in the wrong way, the
parser, if it's a valid XML one, will complain about invalid characters.
HTML parsers should complain too, but browsers have lots of fallbacks to
hide this error from users.
All in all, this shouldn't affect users of this XHTMLized hypermail
as its backwards compatible with HTML browsers. As long as your
documents are served with the text/html MIME type, things will work
as usual for you. There are some turn arounds I can add if needed
if more problems should arise. For example, we can suppress the
XML prologue and just handle the document as HTML one. Let's wait and
see how this turns out.
-------------
Some links:
-- The XHTML1.0 recommendation
http://www.w3.org/TR/xhtml1/
In particular, look at the "differences with HTML4" section:
http://www.w3.org/TR/xhtml1/#diffs
-- The WDG command line validator (GPL.. free)
http://www.htmlhelp.com/tools/validator/
I found it quite nifty and useful during the convertion
-- The W3C on-line validator
http://validator.w3.org/
Hope this is helpful! Send in your bug reports and comments.
-jose
![]()