| # Character Set Detection |
| |
| ## Overview |
| |
| Character set detection is the process of determining the character set, or |
| encoding, of character data in an unknown format. This is, at best, an imprecise |
| operation using statistics and heuristics. Because of this, detection works best |
| if you supply at least a few hundred bytes of character data that's mostly in a |
| single language. In some cases, the language can be determined along with the |
| encoding. |
| |
| Several different techniques are used for character set detection. For |
| multi-byte encodings, the sequence of bytes is checked for legal patterns. The |
| detected characters are also check against a list of frequently used characters |
| in that encoding. For single byte encodings, the data is checked against a list |
| of the most commonly occurring three letter groups for each language that can be |
| written using that encoding. The detection process can be configured to |
| optionally ignore html or xml style markup, which can interfere with the |
| detection process by changing the statistics. |
| |
| The input data can either be a Java input stream, or an array of bytes. The |
| output of the detection process is a list of possible character sets, with the |
| most likely one first. For simplicity, you can also ask for a Java Reader that |
| will read the data in the detected encoding. |
| |
| There is another character set detection C++ library, the [Compact Encoding |
| Detector](https://github.com/google/compact_enc_det), that may have a lower |
| error rate, particularly when working with short samples of text. |
| |
| ## CharsetMatch |
| |
| The CharsetMatch class holds the result of comparing the input data to a |
| particular encoding. You can use an instance of this class to get the name of |
| the character set, the language, and how good the match is. You can also use |
| this class to decode the input data. |
| |
| To find out how good the match is, you use the getConfidence() method to get a |
| *confidence value*. This is an integer from 0 to 100. The higher the value, the |
| more confidence there is in the match For example: |
| |
| CharsetMatch match = ...; |
| int confidence; |
| confidence = match.getConfidence(); |
| if (confidence < 50 ) { |
| // handle a poor match... |
| } else { |
| // handle a good match... |
| } |
| |
| In C, you can use ucsdet_getConfidence(const UCharsetMatch \*ucsm, UErrorCode |
| \*status) method to get a confidence value |
| |
| const UCharsetMatch \*ucm; |
| UErrorCode status = U_ZERO_ERROR; |
| int32_t confidence = ucsdet_getConfidence(ucm, &status); |
| if (confidence <50 ){ |
| //handle a poor match... |
| } else { |
| //handle a good match... |
| } |
| |
| To get the name of the character set, which can be used as an encoding name in |
| Java, you use the getName() method: |
| |
| CharsetMatch match = ...; |
| byte characterData\[\] = ...; |
| String charsetName; |
| String unicodeData; |
| charsetName = match.getName(); |
| unicodeData = new String(characterData, charsetName); |
| |
| To get the name of the character set in C : |
| |
| const UCharsetMatch \*ucm; |
| UErrorCode status = U_ZERO_ERROR; |
| const char \*name = ucsdet_getName(ucm, &status); |
| |
| To get the three letter ISO code for the detected language, you use the |
| getLanguage() method. If the language could not be determined, getLanguage() |
| will return null. Note that language detection does not work with all charsets, |
| and includes only a very small set of possible languages. It should not used if |
| robust, reliable language detection is required. |
| |
| CharsetMatch match = ...; |
| String languageCode; |
| languageCode = match.getLanguage(); |
| if (languageCode != null) { |
| // handle the language code... |
| } |
| |
| The ucsdet_getLanguage(const UCharsetMatch \*ucsm, UErrorCode \*status) method |
| can be used in C to get the language code. If the language could not be |
| determined, the method will return an empty string. |
| |
| const UCharsetMatch \*ucm; |
| UErrorCode status = U_ZERO_ERROR; |
| const char \*language = ucsdet_getLanguage(ucm, &status); |
| |
| If you want to get a Java String containing the converted data you can use the |
| getString() method: |
| |
| CharsetMatch match = ...; |
| String unicodeData; |
| unicodeData = match.getString(); |
| |
| If you want to limit the number of characters in the string, pass the maximum |
| number of characters you want to the getString() method: |
| |
| CharsetMatch match = ...; |
| String unicodeData; |
| unicodeData = match.getString(1024); |
| |
| To get a java.io.Reader to read the converted data, use the getReader() method: |
| |
| CharsetMatch match = ...; |
| Reader reader; |
| StringBuffer sb = new StringBuffer(); |
| char\[\] buffer = new char\[1024\]; |
| int bytesRead = 0; |
| reader = match.getReader(); |
| while ((bytesRead = reader.read(buffer, 0, 1024)) >= 0) { |
| sb.append(buffer, 0, bytesRead); |
| } |
| reader.close(); |
| |
| ## CharsetDetector |
| |
| The CharsetDetector class does the actual detection. It matches the input data |
| against all character sets, and computes a list of CharsetMatch objects to hold |
| the results. The input data can be supplied as an array of bytes, or as a |
| java.io.InputStream. |
| |
| To use a CharsetDetector object, first you construct it, and then you set the |
| input data, using the setText() method. Because setting the input data is |
| separate from the construction, it is easy to reuse a CharsetDetector object: |
| |
| CharsetDetector detector; |
| byte\[\] byteData = ...; |
| InputStream streamData = ...; |
| detector = new CharsetDetector(); |
| detector.setText(byteData); |
| // use detector with byte data... |
| detector.setText(streamData); |
| // use detector with stream data... |
| |
| If you want to know which character set matches your input data with the highest |
| confidence, you can use the detect() method, which will return a CharsetMatch |
| object for the match with the highest confidence: |
| |
| CharsetDetector detector; |
| CharsetMatch match; |
| byte\[\] byteData = ...; |
| detector = new CharsetDetector(); |
| detector.setText(byteData); |
| match = detector.detect(); |
| |
| If you want to know which character set matches your input data in C, you can |
| use the ucsdet_detect( UCharsetDetector \*csd , UErrorCode \*status) method. |
| |
| UCharsetDetector\* csd; |
| const UCharsetMatch \*ucm; |
| static char buffer\[BUFFER_SIZE\] = {....}; |
| int32_t inputLength = ... //length of the input text |
| UErrorCode status = U_ZERO_ERROR; |
| ucsdet_setText(csd, buffer, inputLength, &status); |
| ucm = ucsdet_detect(csd, &status); |
| |
| If you want to know all of the character sets that could match your input data |
| with a non-zero confidence, you can use the detectAll() method, which will |
| return an array of CharsetMatch objects sorted by confidence, from highest to |
| lowest.: |
| |
| CharsetDetector detector; |
| CharsetMatch matches\[\]; |
| byte\[\] byteData = ...; |
| detector = new CharsetDetector(); |
| detector.setText(byteData); |
| matches = detector.detectAll(); |
| for (int m = 0; m < matches.length; m += 1) { |
| // process this match... |
| } |
| *Note:The ucsdet_detectALL( UCharsetDetector \*csd , int32_t\* matchesFound, |
| UErrorCode\* status) method can be used in C in order to detect all of the |
| character sets where matchesFound is a pointer to a variable that will be set to |
| the number of charsets identified that are consistent with the input data. * |
| |
| The CharsetDetector class also implements a crude *input filter* that can strip |
| out html and xml style tags. If you want to enable the input filter, which is |
| disabled when you construct a CharsetDetector, you use the enableInputFilter() |
| method, which takes a boolean. Pass in true if you want to enable the input |
| filter, and false if you want to disable it: |
| |
| CharsetDetector detector; |
| CharsetMatch match; |
| byte\[\] byteDataWithTags = ...; |
| detector = new CharsetDetector(); |
| detector.setText(byteDataWithTags); |
| detector.enableInputFilter(true); |
| match = detector.detect(); |
| |
| To enable an input filter in C, you can use ucsdet_enableInputFilter( |
| UCharsetDetector\* csd, UBool filter) function. |
| |
| UCharsetDetector\* csd; |
| const UCharsetMatch \*ucm; |
| static char buffer\[BUFFER_SIZE\] = {....}; |
| int32_t inputLength = ... //length of the input text |
| UErrorCode status = U_ZERO_ERROR; |
| ucsdet_setText(csd, buffer, inputLength, &status); |
| ucsdet_enableInputFilter( csd, TRUE); |
| ucm = ucsdet_detect(csd, &status); |
| |
| If you have more detailed knowledge about the structure of the input data, it is |
| better to filter the data yourself before you pass it to CharsetDetector. For |
| example, you might know that the data is from an html page that contains CSS |
| styles, which will not be stripped by the input filter. |
| |
| You can use the inputFilterEnabled() method to see if the input filter is |
| enabled: |
| |
| CharsetDetector detector; |
| detector = new CharsetDetector(); |
| // do a bunch of stuff with detector |
| // which may or may not enable the input filter... |
| if (detector.inputFilterEnabled()) { |
| // handle enabled input filter |
| } else { |
| // handle disabled input filter |
| } |
| *Note:The ICU4C API provide uscdet_isInputFilterEnabled(const UCharsetDetector\* |
| csd) function to check whether the input filter is enabled.* |
| |
| The CharsetDetector class also has two convenience methods that let you detect |
| and convert the input data in one step: the getReader() and getString() methods: |
| |
| CharsetDetector detector; |
| byte\[\] byteData = ...; |
| InputStream streamData = ...; |
| String unicodeData; |
| Reader unicodeReader; |
| detector = new CharsetDetector(); |
| unicodeData = detector.getString(byteData, null); |
| unicodeReader = detector.getReader(streamData, null); |
| *Note: the second argument to the getReader() and getString() methods is a |
| String called declaredEncoding, which is not currently used. There is also a |
| setDeclaredEncoding() method, which is also not currently used.* |
| |
| The following code is equivalent to using the convenience methods: |
| |
| CharsetDetector detector; |
| CharsetMatch match; |
| byte\[\] byteData = ...; |
| InputStream streamData = ...; |
| String unicodeData; |
| Reader unicodeReader; |
| detector = new CharsetDetector(); |
| detector.setText(byteData); |
| match = detector.detect(); |
| unicodeData = match.getString(); |
| detector.setText(streamData); |
| match = detector.detect(); |
| unicodeReader = match.getReader();CharsetDetector |
| |
| ## Detected Encodings |
| |
| The following table shows all the encodings that can be detected. You can get |
| this list (without the languages) by calling the getAllDetectableCharsets() |
| method: |
| |
| Character Set Languages |
| |
| UTF-8 |
| |
| UTF-16BE |
| |
| UTF-16LE |
| |
| UTF-32BE |
| |
| UTF-32LE |
| |
| Shift_JIS |
| |
| Japanese |
| |
| ISO-2022-JP |
| |
| Japanese |
| |
| ISO-2022-CN |
| |
| Simplified Chinese |
| |
| ISO-2022-KR |
| |
| Korean |
| |
| GB18030 |
| |
| Chinese |
| |
| Big5 |
| |
| Traditional Chinese |
| |
| EUC-JP |
| |
| Japanese |
| |
| EUC-KR |
| |
| Korean |
| |
| ISO-8859-1 |
| |
| Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Swedish |
| |
| ISO-8859-2 |
| |
| Czech, Hungarian, Polish, Romanian |
| |
| ISO-8859-5 |
| |
| Russian |
| |
| ISO-8859-6 |
| |
| Arabic |
| |
| ISO-8859-7 |
| |
| Greek |
| |
| ISO-8859-8 |
| |
| Hebrew |
| |
| ISO-8859-9 |
| |
| Turkish |
| |
| windows-1250 |
| |
| Czech, Hungarian, Polish, Romanian |
| |
| windows-1251 |
| |
| Russian |
| |
| windows-1252 |
| |
| Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Swedish |
| |
| windows-1253 |
| |
| Greek |
| |
| windows-1254 |
| |
| Turkish |
| |
| windows-1255 |
| |
| Hebrew |
| |
| windows-1256 |
| |
| Arabic |
| |
| KOI8-R |
| |
| Russian |
| |
| IBM420 |
| |
| Arabic |
| |
| IBM424 |
| |
| Hebrew |