blob: e62b1947882f0c149ca4a83a3af1beab23f52694 [file] [log] [blame]
Q: Why does libiconv support encoding XXX? Why does libiconv not support
encoding ZZZ?
A: libiconv, as an internationalization library, supports those character
sets and encodings which are in wide-spread use in at least one territory
of the world.
Hint1: On http://www.w3c.org/ you find a page "Languages, countries, and
the charsets typically used for them" (with a minor mistake about Albanian).
From this table, we can conclude that the following are in active use:
ISO-8859-1, CP1252 Afrikaans, Albanian, Basque, Catalan, Danish, Dutch,
English, Faroese, Finnish, French, Galician, German,
Icelandic, Irish, Italian, Norwegian, Portuguese,
Scottish, Spanish, Swedish
ISO-8859-2 Croatian, Czech, Hungarian, Polish, Romanian, Slovak,
Slovenian
ISO-8859-3 Esperanto, Maltese
ISO-8859-5 Bulgarian, Byelorussian, Macedonian, Russian,
Serbian, Ukrainian
ISO-8859-6 Arabic
ISO-8859-7 Greek
ISO-8859-8 Hebrew
ISO-8859-9, CP1254 Turkish
ISO-8859-10 Estonian, Inuit, Lapp, Latvian, Lithuanian
KOI8-R Russian
SHIFT_JIS Japanese
ISO-2022-JP Japanese
EUC-JP Japanese
Ordered by frequency on the web (1997):
ISO-8859-1, CP1252 96%
SHIFT_JIS 1.6%
ISO-2022-JP 1.2%
EUC-JP 0.4%
CP1250 0.3%
CP1251 0.2%
CP850 0.1%
MACINTOSH 0.1%
ISO-8859-5 0.1%
ISO-8859-2 0.0%
Hint2: The character sets mentioned in the XFree86 4.0 locale.alias file.
ISO-8859-1 Afrikaans, Basque, Breton, Catalan, Danish, Dutch,
English, Estonian, Faroese, Finnish, French,
Galician, German, Greenlandic, Icelandic,
Indonesian, Irish, Italian, Lithuanian, Norwegian,
Occitan, Portuguese, Scottish, Spanish, Swedish,
Walloon, Welsh
ISO-8859-2 Albanian, Croatian, Czech, Hungarian, Polish,
Romanian, Serbian, Slovak, Slovenian
ISO-8859-3 Esperanto
ISO-8859-4 Estonian, Latvian, Lithuanian
ISO-8859-5 Bulgarian, Byelorussian, Macedonian, Russian,
Serbian, Ukrainian
ISO-8859-6 Arabic
ISO-8859-7 Greek
ISO-8859-8 Hebrew
ISO-8859-9 Turkish
ISO-8859-14 Breton, Irish, Scottish, Welsh
ISO-8859-15 Basque, Breton, Catalan, Danish, Dutch, Estonian,
Faroese, Finnish, French, Galician, German,
Greenlandic, Icelandic, Irish, Italian, Lithuanian,
Norwegian, Occitan, Portuguese, Scottish, Spanish,
Swedish, Walloon, Welsh
KOI8-R Russian
KOI8-U Russian, Ukrainian
EUC-JP (alias eucJP) Japanese
ISO-2022-JP (alias JIS7) Japanese
SHIFT_JIS (alias SJIS) Japanese
U90 Japanese
S90 Japanese
EUC-CN (alias eucCN) Chinese
EUC-TW (alias eucTW) Chinese
BIG5 Chinese
EUC-KR (alias eucKR) Korean
ARMSCII-8 Armenian
GEORGIAN-ACADEMY Georgian
GEORGIAN-PS Georgian
TIS-620 (alias TACTIS) Thai
MULELAO-1 Laothian
IBM-CP1133 Laothian
VISCII Vietnamese
TCVN Vietnamese
NUNACOM-8 Inuktitut
Hint3: The character sets supported by Netscape Communicator 4.
Where is this documented? For the complete picture, I had to use
"strings netscape" and then a lot of guesswork. For a quick take,
look at the "View - Character set" menu of Netscape Communicator 4.6:
ISO-8859-{1,2,5,7,9,15}
WINDOWS-{1250,1251,1253}
KOI8-R Cyrillic
CP866 Cyrillic
Autodetect Japanese (EUC-JP, ISO-2022-JP, ISO-2022-JP-2, SJIS)
EUC-JP Japanese
SHIFT_JIS Japanese
GB2312 Chinese
BIG5 Chinese
EUC-TW Chinese
Autodetect Korean (EUC-KR, ISO-2022-KR, but not JOHAB)
UTF-8
UTF-7
Hint4: The character sets supported by Microsoft Internet Explorer 4.
ISO-8859-{1,2,3,4,5,6,7,8,9}
WINDOWS-{1250,1251,1252,1253,1254,1255,1256,1257}
KOI8-R Cyrillic
KOI8-RU Ukrainian
ASMO-708 Arabic
EUC-JP Japanese
ISO-2022-JP Japanese
SHIFT_JIS Japanese
GB2312 Chinese
HZ-GB-2312 Chinese
BIG5 Chinese
EUC-KR Korean
ISO-2022-KR Korean
WINDOWS-874 Thai
WINDOWS-1258 Vietnamese
UTF-8
UTF-7
UNICODE actually UNICODE-LITTLE
UNICODEFEFF actually UNICODE-BIG
and various DOS character sets: DOS-720, DOS-862, IBM852, CP866.
We take the union of all these four sets. The result is:
European and Semitic languages
* ASCII.
We implement this because it is occasionally useful to know or to
check whether some text is entirely ASCII (i.e. if the conversion
ISO-8859-x -> UTF-8 is trivial).
* ISO-8859-{1,2,3,4,5,6,7,8,9,10}
We implement this because they are widely used. Except ISO-8859-4
which appears to have been superseded by ISO-8859-10 in the baltic
countries. But it's an ISO standard anyway.
* ISO-8859-{13,14}
We implement this because it's an ISO standard.
* ISO-8859-15
We implement this because it's increasingly used in Europe, because
of the Euro symbol.
* KOI8-R, KOI8-U
We implement this because it appears to be the predominant encoding
on Unix in Russia and Ukraine, respectively.
* KOI8-RU
We implement this because MSIE4 supports it.
* CP{1250,1251,1252,1253,1254,1255,1256,1257}
We implement these because they are the predominant Windows encodings
in Europe.
* CP850
We implement this because it is mentioned as occurring in the web
in the aforementioned statistics.
* CP866
We implement this because Netscape Communicator does.
* Mac{Roman,CentralEurope,Croatian,Romania,Cyrillic,Greek,Turkish} and
Mac{Hebrew,Arabic}
We implement these because the Sun JDK does, and because Mac users
don't deserve to be punished.
* Macintosh
We implement this because it is mentioned as occurring in the web
in the aforementioned statistics.
Japanese
* EUC-JP, SHIFT-JIS, ISO-2022-JP
We implement these because they are widely used. EUC-JP and SHIFT-JIS
are more used for files, whereas ISO-2022-JP is recommended for email.
* ISO-2022-JP-2
We implement this because it's the common way to represent mails which
make use of JIS X 0212 characters.
* ISO-2022-JP-1
We implement this because it's in the RFCs, but I don't think it is
really used.
* U90, S90
We DON'T implement this because I have no informations about what it
is or who uses it.
Simplified Chinese
* EUC-CN = GB2312
We implement this because it is the widely used representation
of simplified Chinese.
* GBK
We implement this because it appears to be used on Solaris and Windows.
* ISO-2022-CN
We implement this because it is in the RFCs, but I have no idea
whether it is really used.
* ISO-2022-CN-EXT
We implement this because it's in the RFCs, but I don't think it is
really used.
* HZ = HZ-GB-2312
We implement this because the RFCs recommend it for Usenet postings,
and because MSIE4 supports it.
Traditional Chinese
* EUC-TW
We implement it because it appears to be used on Unix.
* BIG5
We implement it because it is the de-facto standard for traditional
Chinese.
* CP950
We implement this because it is the Microsoft variant of BIG5, used
on Windows.
* BIG5+
We DON'T implement this because it doesn't appear to be in wide use.
Only the CWEX fonts use this encoding. Furthermore, the conversion
tables in the big5p package are not coherent: If you convert directly,
you get different results than when you convert via GBK.
Korean
* EUC-KR, JOHAB
We implement these because they appear to be the widely used
representations for Korean.
* ISO-2022-KR
We implement it because it is in the RFCs and because MSIE4 supports
it, but I have no idea whether it's really used.
Armenian
* ARMSCII-8
We implement it because XFree86 supports it.
Georgian
* Georgian-Academy, Georgian-PS
We implement these because they appear to be both used for Georgian;
Xfree86 supports them.
Thai
* TIS-620
We implement this because it seems to be standard for Thai.
* CP874
We implement this because MSIE4 supports it.
* MacThai
We implement this because the Sun JDK does, and because Mac users
don't deserve to be punished.
Laotian
* MuleLao-1, CP1133
We implement these because XFree86 supports them. I have no idea which
one is used more widely.
Vietnamese
* VISCII, TCVN
We implement these because XFree86 supports them.
* CP1258
We implement this because MSIE4 supports it.
Other languages
* NUNACOM-8 (Inuktitut)
We DON'T implement this because it isn't part of Unicode yet, and
therefore doesn't convert to anything except itself.
Platform specifics
* HP-ROMAN8, NEXTSTEP
We implement these because they were the native character set on HPs
and NeXTs for a long time, and libiconv is intended to be usable on
these old machines.
Full Unicode
* UTF-8, UCS-2, UCS-4
We implement these. Obviously.
* UTF-16
We implement this, because it is still the favourite encoding of the
president of the Unicode Consortium (for political reasons).
* UTF-7
We implement this because it is essential functionality for mail
applications.
* JAVA
We implement it because it's used for Java programs and because it's
a nice encoding for debugging.
* UNICODE (big endian), UNICODEFEFF (little endian)
We DON'T implement these because they are stupid and not standardized.
Full Unicode, in terms of `uint16_t' or `uint32_t'
(with machine dependent endianness and alignment)
* UCS-2-INTERNAL, UCS-4-INTERNAL
We implement these because they are the preferred internal
representation of strings in Unicode aware applications.
Q: Support encodings mentioned in RFC 1345 ?
A: No, they are not in use any more. Supporting ISO-646 variants is pointless
since ISO-8859-* have been adopted.
Q: Support EBCDIC ?
A: No!
Q: How do I add a new character set?
A: 1. Explain the "why" in this file, above.
2. You need to have a conversion table from/to Unicode. Transform it into
the format used by the mapping tables found on ftp.unicode.org: each line
contains the character code, in hex, with 0x prefix, then whitespace,
then the Unicode code point, in hex, 4 hex digits, with 0x prefix. '#'
counts as a comment delimiter until end of line.
Please also send your table to Mark Leisher <mleisher@crl.nmsu.edu> so he
can include it in his collection.
3. If it's an 8-bit character set, use the '8bit_tab_to_h' program in the
tools directory to generate the C code for the conversion. You may tweak
the resulting C code if you are satisfied with its quality, but this is
rarely needed.
If it's a two-dimensional character set (with rows and columns), use the
'cjk_tab_to_h' program in the tools directory to generate the C code for
the conversion. You will need to modify the main() function to recognize
the new character set name, with the proper dimensions, but that shouldn't
be too hard. This yields the CCS. The CES you have to write by hand.
4. Store the resulting C code file in the src directory. Add a #include
directive to converters.h, and add an entry to the encodings.def file.
5. Compile the package, and test your new encoding using a program like
iconv(1) or clisp(1).
6. Augment the testsuite: Add a line to tests/Makefile.in. For a stateless
encoding, create the complete table as a TXT file. For a stateful encoding,
provide a text snippet encoded using your new encoding and its UTF-8
equivalent.
7. Update the README and man3/iconv_open.3, to mention the new encoding.
Add a note in the NEWS file.
Q: What about bidirectional text? Should it be tagged or reversed when
converting from ISO-8859-8 or ISO-8859-6 to Unicode? Qt appears to do
this, see qt-2.0.1/src/tools/qrtlcodec.cpp.
A: After reading RFC 1556: I don't think so. Support for ISO-8859-8-I and
ISO-8859-E remains to be implemented.
On the other hand, a page on www.w3c.org says that ISO-8859-8 in *email*
is visually encoded, ISO-8859-8 in *HTML* is logically encoded, i.e.
the same as ISO-8859-8-I. I'm confused.
Other character sets not implemented:
"MNEMONIC" = "csMnemonic"
"MNEM" = "csMnem"
"ISO-10646-UCS-Basic" = "csUnicodeASCII"
"ISO-10646-Unicode-Latin1" = "csUnicodeLatin1" = "ISO-10646"
"ISO-10646-J-1"
"UNICODE-1-1" = "csUnicode11"
"csWindows31Latin5"
Other aliases not implemented (and not implemented in glibc-2.1 either):
From MSIE4:
ISO-8859-1: alias ISO8859-1
ISO-8859-2: alias ISO8859-2
KSC_5601: alias KS_C_5601
UTF-8: aliases UNICODE-1-1-UTF-8 UNICODE-2-0-UTF-8
Q: How can I integrate libiconv into my package?
A: Just copy the entire libiconv package into a subdirectory of your package.
At configuration time, call libiconv's configure script with the
appropriate --srcdir option and maybe --enable-static or --disable-shared.
Then "cd libiconv && make && make install-lib libdir=... includedir=...".
'install-lib' is a special (not GNU standardized) target which installs
only the include file - in $(includedir) - and the library - in $(libdir) -
and does not use other directory variables. After "installing" libiconv
in your package's build directory, building of your package can proceed.
Q: Why is the testsuite so big?
A: Because some of the tests are very comprehensive.
If you don't feel like using the testsuite, you can simply remove the
tests/ directory.