This is an article I needed to find myself six months ago. Feel free to link gratuitously with phrases like “html translation” and “unicode web” and “foreign language web site” and any other appropriate search term you can think of so that others may benefit from it.
I’ve recently had to get my hands dirty with HTML in French, Greek characters, and English OS support for Asian languages, so I figured I’d pass on the results of my muddling creating the various translations of the Zen Garden. This is a short but sweet summary of what I know on the subject.
First of all: How in the world do you even start with foreign character support, especially if you don’t speak the language? If you receive a foreign-language document and get asked to put it on the web, this is about the point you start panicking.
Relax, it’s actually surprisingly easy, given a fairly modern Operating System with decent language support. Here’s what you need to know.
Operating System Support
You may not be able to see the document in its original character set, but depending on your OS, you might be able to copy and paste the characters between documents without damaging the data. I’ve had luck copying from Windows Notepad and pasting into my HTML editor.
Easiest way to tell is to try with a small amount of data — paste it into a properly-encoded document (see below), and view it in Mozilla or IE6. If it renders properly with the desired characters intact, you’re good to go.
If it doesn’t, you may not have the correct language pack installed — it should be possible to work with the data anyway (even if you can’t view it — just make sure you test on a system that can), but it can’t hurt to install any foreign language packs you can get your hands on, just in case. The 200MB of disk space is negligible in 2003.
UTF-16 files are out. Do not try saving your .html, .asp, or.php as a double-byte Unicode file. Most modern browsers support it, but some older ones do not (IE5/Mac comes to mind). Not only that, but your file size doubles, and IIS and PHP alike have trouble with the files so unless you’re serving up static HTML (not likely in 2003) you won’t be able to use them anyway.
Feel free to save a properly-encoded or UTF-8 document as anything you wish though. It can be .html, .php, .asp and so on.
It’s all about character encoding, baby. Redundancy is the key; define your XML namespace if working with XHTML, and also (regardless if you’re using HTML 4.01 or XHTML 1.x) add a
<meta> tag to specify your document’s encoding.
It goes in your
<html> tag, and looks like this:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
In this case, English is the language, designated by the
"en". (complete list of the ISO 639 character codes)
<meta> Tag Encoding:
On top of setting your XML language, 9.8 times out of 10 you’ll also want to specify document encoding. I’m a little unclear on the difference between the two, but WaSP has a summary of the best way to encode a document. Syntax looks like this:
<meta http-equiv="content-type" content="text/html; charset=iso-8859-1" />
charset (character set) is the key. For most western European languages based on Latin characters, you won’t need to change this; just include it. For eastern European, Asian, and all other languages, there are different
charsets — lists are available but the best resource for this is in your Mozilla-based browser; hit View->Character Coding, and you should find a comprehensive list of all possibilites with their associated charset value. Use the code in brackets (
US-ASCII etc.) and not the full name.
Note that the WaSP article linked above has further information on server-side character encoding. This is beyond my current abilities, but is something highly recommended by the W3C. Worth a read, if you want to really do it properly.
Unicode character encoding works just fine, and in some cases is preferable. The difference here is that we’re not saving the document as a double-byte Unicode file; we’re instead merely setting the document’s
charset to Unicode through the
meta tag. Sample Unicode encoding:
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
As far as individual characters, you may want to try using HTML Character Entities for occurances of non-ASCII characters. That is, you might want to use Ü instead of just the character itself, Ü. This can be tedious and trying though, and given proper encoding as discussed above, may even be unnecessary.
One last thing to consider before we wrap up. WAI lists “identifying changes in language” as a priority 1 accessibility concern, which is to say, it’s Really Important that you do this. If your HTML switches at any point from the main language to another, you must provide some cue for the browser that this is happening. See the WAI for more on this.
This document was written by an embarrassingly unilingual English speaker with extremely limited foreign language capability beyond grade-school French classes. If I’ve managed to wrangle over a dozen translations of a document using these techniques, chances are they’re good enough for most cases. Inevitably I’ll have made some errors and over-simplified, but hey — that’s what the comments are for.
- On the goodness of Unicode (ongoing)
- Characters vs. Bytes (ongoing)
- Walking Backwards: Supporting Non-Western Languages on the Web (A List Apart)
- Specifying Character Encoding (WaSP)
- On Multilingual Web Sites and CSS (J. Korpella)