| --- |
| layout: default |
| title: ICU Data |
| nav_order: 13 |
| has_children: true |
| --- |
| <!-- |
| © 2020 and later: Unicode, Inc. and others. |
| License & terms of use: http://www.unicode.org/copyright.html |
| --> |
| |
| # ICU Data |
| {: .no_toc } |
| |
| ## Contents |
| {: .no_toc .text-delta } |
| |
| 1. TOC |
| {:toc} |
| |
| --- |
| |
| ## Overview |
| |
| ICU makes use of a wide variety of data tables to provide many of its services. |
| Examples include converter mapping tables, collation rules, transliteration |
| rules, break iterator rules and dictionaries, and other locale data. Additional |
| data can be provided by users, either as customizations of ICU's data or as new |
| data altogether. |
| |
| This section describes how ICU data is stored and located at run time. It also |
| describes how ICU data can be customized to suit the needs of a particular |
| application. |
| |
| For simple use of ICU's predefined data, this section on data management can |
| safely be skipped. The data is built into a library that is loaded along with |
| the rest of ICU. No specific action or setup is required of either the |
| application program or the execution environment. |
| |
| Update: as of ICU 64, the standard data library is over 20 MB in size. We have |
| introduced a new tool, the [ICU Data Build Tool](./icu_data/buildtool.md), |
| to give you more control over what goes into your ICU locale data file. |
| |
| > :point_right: **Note**: ICU for C by default comes with pre-built data. |
| > The source data files are included as an "icu\*data.zip" file starting in ICU4C 49. |
| > Previously, they were not included unless ICU is downloaded from the [source repository](http://site.icu-project.org/repository). |
| |
| ## ICU and CLDR Data |
| |
| Most of ICU's data is sourced from [CLDR](http://cldr.unicode.org), the [Common |
| Locale Data Repository](http://cldr.unicode.org) project. Do not file bugs |
| against ICU to request data changes in CLDR, see the CLDR project's page itself. |
| Also note that most ICU data files are therefore autogenerated from CLDR, and so |
| manually editing them is not usually recommended. |
| |
| Data which is NOT sourced from CLDR includes: |
| |
| * [Conversion Data](conversion/data.md) |
| * Break Iterator Dictionary Data ( Thai, CJK, etc ) |
| * Break Iterator Rule Data (as of this writing, it is manually kept in sync |
| with the CLDR datasets) |
| |
| For information on building ICU data from CLDR, see the |
| [cldr-icu-readme](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/cldr-icu-readme.txt). |
| |
| ## ICU Data Directory |
| |
| The ICU data directory is the default location for all ICU data. Any requests |
| for data items that do not include an explicit directory path will be resolved |
| to files located in the ICU data directory. |
| |
| The ICU data directory is determined as follows: |
| |
| 1. If the application has called the function `u_setDataDirectory()`, use the |
| directory specified there, otherwise: |
| |
| 2. If the environment variable `ICU_DATA` is set, use that, otherwise: |
| |
| 3. If the C preprocessor variable `ICU_DATA_DIR` was set at the time ICU was |
| built, use its compiled-in value. |
| |
| 4. Otherwise, the ICU data directory is an empty string. This is the default |
| behavior for ICU using a shared library for its data and provides the |
| highest data loading performance. |
| |
| > :point_right: **Note**: `u_setDataDirectory()` is not thread-safe. Call it |
| > *before* calling ICU APIs from multiple threads. If you use both |
| > `u_setDataDirectory()` and `u_init()`, then use `u_setDataDirectory()` first. |
| > |
| > *Earlier versions of ICU supported two additional schemes: setting a data |
| > directory relative to the location of the ICU shared libraries, and on Windows, |
| > taking a location from the registry. These have both been removed to make the |
| > behavior more predictable and easier to understand.* |
| |
| The ICU data directory does not need to be set in order to reference the |
| standard built-in ICU data. Applications that just use standard ICU capabilities |
| (converters, locales, collation, etc.) but do not build and reference their own |
| data do not need to specify an ICU data directory. |
| |
| ### Multiple-Item ICU Data Directory Values |
| |
| The ICU data directory string can contain multiple directories as well as .dat |
| path/filenames. They must be separated by the path separator that is used on the |
| platform, for example a semicolon (`;`) on Windows. Data files will be searched in |
| all directories and .dat package files in the order of the directory string. For |
| details, see the example below. |
| |
| ## Default ICU Data |
| |
| The default ICU data consists of the data needed for the converters, collators, |
| locales, etc. that are provided with ICU. Default data must be present in order |
| for ICU to function. |
| |
| The default data is most commonly built into a shared library that is installed |
| with the other ICU libraries. Nothing is required of the application for this |
| mechanism to work. ICU provides additional options for loading the default data |
| if more flexibility is required. |
| |
| Here are the steps followed by ICU to locate its default data. This procedure |
| happens only once per process, at the time an ICU data item is first requested. |
| |
| 1. If the application has called the function `udata_setCommonData()`, use the |
| data that was provided. The application specifies the address in memory of |
| an image of an ICU common format data file (either in shared-library format |
| or .dat package file format). |
| |
| 2. Examine the contents of the default ICU data shared library. If it contains |
| data, use that data. If the data library is empty, a stub library, proceed |
| to the next step. (A data shared library must always be present in order for |
| ICU to successfully link and load. A stub data library is used when the |
| actual ICU common data is to be provided from another source). |
| |
| 3. Dynamically load (memory map, typically) a common format (.dat) file |
| containing the default ICU data. Loading is described in the section |
| [How Data Loading Works](icudata#how-data-loading-works). The path to |
| the data is of the form "icudt\<version\>\<flag\>", where \<version\> is |
| the two-digit ICU version number, and \<flag\> is a letter indicating the |
| internal format of the file (see the |
| [Sharing ICU Data Between Platforms](icudata#sharing-icu-data-between-platforms) |
| section). |
| |
| Once the default ICU data has been located, loading of individual data items |
| proceeds as described in the section |
| [How Data Loading Works](icudata#how-data-loading-works). |
| |
| ## Building and Linking against ICU data |
| |
| When using ICU's configure or runConfigureICU tool to build, several different |
| methods of packging are available. |
| |
| > :point_right: **Note**: in all cases, you **must** link all ICU tools and |
| applications against a "data library": either a data library containing the ICU |
| data, or against the "stubdata" library located in icu/source/stubdata. For |
| example, even if ICU is built in "files" mode, you must still link against the |
| "stubdata" library or an undefined symbol error occurs. |
| |
| * `--with-data-packaging=library` |
| This mode builds a shared library (DLL or .so). This is the simplest mode to |
| use, and is the default. |
| To use: link your application against the common and data libraries. |
| This is the only directly supported behavior on Windows builds. |
| * `--with-data-packaging=static` |
| This option builds ICU data as a single (large) static library. This mode is |
| more complex to use. If you encounter errors, you may need to build ICU |
| multiple times. |
| * `--with-data-packaging=files` |
| With this option, ICU outputs separate individual files (.res, .cnv, etc) |
| which will be loaded at runtime. Read the rest of this document, especially |
| the sections that discuss the ICU directory path. |
| * `--with-data-packaging=archive` |
| With this option, ICU outputs a single "icudt__.dat" file containing ICU |
| data. Read the rest of this document, especially the sections that discuss |
| the ICU directory path. |
| |
| ## Time Zone Data |
| |
| Because time zone data requires frequent updates in response to countries |
| changing their transition dates for daylight saving time, ICU provides |
| additional options for loading time zone data from separate files, thus avoiding |
| the need to update a combined ICU data package. Further information is found |
| under [Time Zones](datetime/timezone/index.md). |
| |
| ## Application Data |
| |
| ICU-based applications can ship and use their own data for localized strings, |
| custom conversion tables, etc. Each data item file must have a package name as a |
| prefix, and this package name must match the basename of a .dat package file, if |
| one is used. The package name must be used in ICU APIs, for example in |
| `udata_setAppData()` (instead of `udata_setCommonData()` which is only used for |
| ICU's own data) and in the pathname argument of `ures_open()`. |
| |
| The only real difference to ICU's own data is that application data cannot be |
| simply loaded by specifying a NULL value for the path arguments of ICU APIs, and |
| application data will not be used by APIs that do not have path/package name |
| arguments at all. |
| |
| The most important APIs that allow application data to be used are for Resource |
| Bundles, which are most often used for localized strings and other data. There |
| are also functions like `ucnv_openPackage()` that allow to specify application |
| data, and the `udata.h` API can be used to load any data with minimum |
| requirements on the binary format, and without ICU interpreting the contents of |
| the data. |
| |
| The `pkgdata` tool, which is used to package the data into various formats (e.g. |
| shared library), has an option (`--without-assembly` or `-w`) to not use |
| assembly code when building and packaging the application specific data into a |
| shared library. Building the data with assembly code, which is enabled by |
| default, is faster and more efficient; however, there are some platform |
| specific issues that may arise. The `--without-assembly` option may be |
| necessary on certain platforms (e.g. Linux) which have trouble properly loading |
| application data when it was built with assembly code and is packaged as a |
| shared library. |
| |
| ## Alignment |
| |
| ICU data is designed to be 16-aligned, with natural alignment of values inside |
| the data structure, so that the data is usable as is when memory-mapped. |
| ("16-aligned" means that the start address is a multiple of 16 bytes.) |
| |
| Memory-mapping (as well as memory allocation) provides at least 16-alignment on |
| modern platforms. Some CPUs require n-alignment of types of size n bytes (and |
| crash on unaligned reads), other CPUs usually operate faster on data that is |
| aligned properly. |
| |
| Some of the ICU code explicitly checks for proper alignment. |
| |
| The `icupkg` tool places data items into the .dat file at start offsets that are |
| multiples of 16 bytes. |
| |
| When using `genccode` to directly write a .o/.obj file, or to write assembler |
| code, it specifies at least 16-alignment. When using `genccode` to write C code, |
| it prepends the data with a double value which should yield at least 8-alignment |
| on most platforms (usually `sizeof(double)=8`). |
| |
| ## Flexibility vs. Installation vs. Performance |
| |
| There are choices that affect ICU data loading and depend on application |
| requirements. |
| |
| ### Data in Shared Libraries/DLLs vs. .dat package files |
| |
| Building ICU data into shared libraries (`--with-data-packaging=library`) is the |
| most convenient packaging method because shared libraries (DLLs) are easily |
| found if they are in the same directory as the application libraries, or if they |
| are on the system library path. The application installer usually just copies |
| the ICU shared libraries in the same place. On the other hand, shared libraries |
| are not portable. |
| |
| Packaging data into .dat files (`--with-data-packaging=archive`) allows them to |
| be shared across platforms, but they must either be loaded by the application |
| and set with `udata_setCommonData()` or `udata_setAppData()`, or they must be |
| in a known location that is included in the ICU data directory string. This |
| requires the application installer, or the application itself at runtime, to |
| locate the ICU and/or application data by setting the ICU data directory (see |
| the [ICU Data Directory](icudata#icu-data-directory) section above) or by |
| loading the data and providing it to one of the `udata_setXYZData()` functions. |
| |
| Unlike shared libraries, .dat package files can be taken apart into separate |
| data item files with the decmn ICU tool. This allows post-installation |
| modification of a package file. The `gencmn` and `pkgdata` ICU tools can then be |
| used to reassemble the .dat package file. |
| |
| For more information about .dat package files see the section [Sharing ICU Data |
| Between Platforms](icudata#sharing-icu-data-between-platforms) below. |
| |
| ### Data Overriding vs. Loading Performance |
| |
| If the ICU data directory string is empty, then ICU will not attempt to load |
| data from the file system. It is then only possible to load data from the |
| linked-in shared library or via `udata_setCommonData()` and |
| `udata_setAppData()`. This is inflexible but provides the highest performance. |
| |
| If the ICU data directory string is not empty, then data items are searched in |
| all directories and matching .dat files mentioned before checking in |
| already-loaded package files. This allows overriding of packaged data items with |
| single files after installation but costs some time for filesystem accesses. |
| This is usually done only once per data item; see |
| [User Data Caching](icudata#user-data-caching) below. |
| |
| ### Single Data Files vs. Packages |
| |
| Single data files (`--with-data-packaging=files`) are easy to replace and can |
| override items inside data packages. However, it is usually desirable to reduce |
| the number of files during installation, and package files use less disk space |
| than many small files. |
| |
| ## How Data Loading Works |
| |
| ICU data items are referenced by three names - a path, a name and a type. The |
| following are some examples: |
| |
| path | name | type |
| -----------------------------|----------|------- |
| c:\\some\\path\\dataLibName | test | dat |
| no path | cnvalias | icu |
| no path | cp1252 | cnv |
| no path | en | res |
| no path | uprops | icu |
| |
| |
| Items with 'no path' specified are loaded from the default ICU data. |
| |
| Application data items include a path, and will be loaded from user data files, |
| not from the ICU default data. For application data, the path argument need not |
| contain an actual directory, but must contain the application data's package |
| name after the last directory separator character (or by itself if there is no |
| directory). If the path argument contains a directory, then it is logically |
| prepended to the ICU data directory string and searched first for data. The path |
| argument can contain at most one directory. (Path separators like semicolon (;) |
| are not handled here.) |
| |
| > :point_right: **Note**: The ICU data directory string itself may |
| contain multiple directories and path/filenames to .dat package files. See the |
| [ICU Data Directory](icudata#icu-data-directory) section. |
| |
| It is recommended to not include the directory in the path argument but to make |
| sure via setting the application data or the ICU data directory string that the |
| data can be located. This simplifies program maintenance and improves |
| robustness. |
| |
| See the API descriptions for the functions `udata_open()` and |
| `udata_openChoice()` for additional information on opening ICU data from within |
| an application. |
| |
| Data items can exist as individual files, or a number of them can be packaged |
| together in a single file for greater efficiency in loading and convenience of |
| distribution. The combined files are called Common Files. |
| |
| Based on the supplied path and name, ICU searches several possible locations |
| when opening data. To make things more concrete in the following descriptions, |
| the following values of path, name and type are used: |
| |
| ``` |
| path = "c:\\some\\path\\dataLibName" |
| name = "test" |
| type = "res" |
| ``` |
| |
| In this case, "dataLibName" is the "package name" part of the path argument, and |
| "c:\\some\\path\\" is the directory part of it. |
| |
| The search sequence for the data for "test.res" is as follows (the first |
| successful loading attempt wins): |
| |
| 1. Try to load the file "dataLibName_test.res" from c:\\some\\data\\. |
| |
| 2. Try to load the file "dataLibName_test.res" from each of the directories in |
| the ICU data directory string. |
| |
| 3. Try to locate the data package for the package name "dataLibName". |
| |
| 1. Try to locate the data package in the internal cache. |
| |
| 2. Try to load the package file "dataLibName.dat" from c:\\some\\data\\. |
| |
| 3. Try to load the package file "dataLibName.dat" from each of the directories |
| in the ICU data directory string. |
| |
| The first steps, loading the data item from an individual file, are omitted if |
| no directory is specified in either the path argument or the ICU data directory |
| string. |
| |
| Package files are loaded at most once and then cached. They are identified only |
| by their package name. Whenever a data item is requested from a package and that |
| package has been loaded before, then the cached package is used immediately |
| instead of searching through the filesystem. |
| |
| > :point_right: **Note**: ICU versions before 2.2 always searched data packages |
| before looking for individual files, which made it impossible to override |
| packaged data items. See the ICU 2.2 download page and the readme for more |
| information about the changes. |
| |
| ## User Data Caching |
| |
| Once loaded, data package files are cached, and stay loaded for the duration of |
| the process. Any requests for data items from an already loaded data package |
| file are routed directly to the cached data. No additional search for loadable |
| files is made. |
| |
| The user data cache is keyed by the base file name portion of the requested |
| path, with any directory portion stripped off and ignored. Using the previous |
| example, for the path name "c:\\some\\path\\dataLibName", the cache key is |
| "dataLibName". After this is cached, a subsequent request for "dataLibName", no |
| matter what directory path is specified, will resolve to the cached data. |
| |
| Data can be explicitly added to the cache of common format data by means of the |
| `udata_setAppData()` function. This function takes as input the path (name) and |
| a pointer to a memory image of a .dat file. The data is added to the cache, |
| causing any subsequent requests for data items from that file name to be routed |
| to the cache. |
| |
| Only data package files are cached. Separate data files that contain just a |
| single data item are not cached; for these, multiple requests to ICU to open the |
| data will result in multiple requests to the operating system to open the |
| underlying file. |
| |
| However, most ICU services (Resource Bundles, conversion, etc.) themselves cache |
| loaded data, so that data is usually loaded only once until the end of the |
| process (or until `u_cleanup()` or `ucnv_flushCache()` or similar are called.) |
| |
| There is no mechanism for removing or updating cached data files. |
| |
| ## Directory Separator Characters |
| |
| If a directory separator (generally '/' or '\\') is needed in a path parameter, |
| use the form that is native to the platform. The ICU header `"putil.h"` defines |
| `U_FILE_SEP_CHAR` appropriately for the platform. |
| |
| > :point_right: **Note**: On Windows, the directory separator must be '\\' for |
| any paths passed to ICU APIs. This is different from native Windows APIs, which |
| generally allow either '/' or '\\'. |
| |
| ## Sharing ICU Data Between Platforms |
| |
| ICU's default data is (at the time of this writing) about 8 MB in size. Because |
| it is normally built as a shared library, the file format is specific to each |
| platform (operating system). The data libraries can not be shared between |
| platforms even though the actual data contents are identical. |
| |
| By distributing the default data in the form of common format .dat files rather |
| than as shared libraries, a single data file can be shared among multiple |
| platforms. This is beneficial if a single distribution of the application (a CD, |
| for example) includes binaries for many platforms, and the size requirements for |
| replicating the ICU data for each platform are a problem. |
| |
| ICU common format data files are not completely interchangeable between |
| platforms. The format depends on these properties of the platform: |
| |
| 1. Byte Ordering (little endian vs. big endian) |
| |
| 2. Base character set - ASCII or EBCDIC |
| |
| This means, for example, that ICU data files are interchangeable between Windows |
| and Linux on X86 (both are ASCII little endian), or between Macintosh and |
| Solaris on SPARC (both are ASCII big endian), but not between Solaris on SPARC |
| and Solaris on X86 (different byte ordering). |
| |
| The single letter following the version number in the file name of the default |
| ICU data file encodes the properties of the file as follows: |
| |
| ``` |
| icudt19l.dat Little Endian, ASCII |
| icudt19b.dat Big Endian, ASCII |
| icudt19e.dat Big Endian, EBCDIC |
| ``` |
| |
| (There are no little endian EBCDIC systems. All non-EBCDIC encodings include an |
| invariant subset of ASCII that is sufficient to enable these files to |
| interoperate.) |
| |
| The packaging of the default ICU data as a .dat file rather than as a shared |
| library is requested by using an option in the configure script at build time. |
| Nothing is required at run time; ICU finds and uses whatever form of the data is |
| available. |
| |
| > :point_right: **Note**: When the ICU data is built in the form of shared |
| libraries, the library names have platform-specific prefixes and suffixes. On |
| Unix-style platforms, all the libraries have the "lib" prefix and one of the |
| usual (".dll", ".so", ".sl", etc.) suffixes. Other than these prefixes and |
| suffixes, the library names are the same as the above .dat files. |
| |
| ## Customizing ICU's Data Library |
| |
| ICU includes a standard library of data that is about 16 MB in size. Most of |
| this consists of conversion tables and locale information. The data itself is |
| normally placed into a single shared library. |
| |
| Update: as of ICU 64, the standard data library is over 20 MB in size. We have |
| introduced a new tool, the [ICU Data Build Tool](icu_data/buildtool.md), |
| to replace the makefiles explained below and give you more control over what |
| goes into your ICU locale data file. |
| |
| ### Adding Converters to ICU |
| |
| The first step is to obtain or create a .ucm (source) mapping data file for the |
| desired converter. A large archive of converter data is maintained by the ICU |
| team at <https://github.com/unicode-org/icu-data/tree/master/charset/data/ucm> |
| |
| We will use `solaris-eucJP-2.7.ucm`, available from the repository mentioned |
| above, as an example. |
| |
| #### Build the Converter |
| |
| Converter source files are compiled into binary converter files (.cnv files) by |
| using the icu tool makeconv. For the example, you can use this command |
| |
| ``` |
| makeconv -v solaris-eucJP-2.7.ucm |
| ``` |
| |
| Some of the .ucm files from the repository will need additional header |
| information before they can be built. Use the error messages from the makeconv |
| tool, .ucm files for similar converters, and the ICU user guide documentation of |
| .ucm files as a guide when making changes. For the `solaris-eucJP-2.7.ucm` |
| example, we will borrow the missing header fields from |
| `source/data/mappings/ibm-33722_P12A-2000.ucm`, which is the standard ICU eucJP |
| converter data. |
| |
| The ucm file format is described in the |
| ["Conversion Data" chapter](conversion/data.md) of this user guide. |
| |
| After adjustment, the header of the `solaris-eucJP-2.7.ucm` file contains these |
| items: |
| |
| ``` |
| <code_set_name> "solaris-eucJP-2.7" |
| <subchar> \\x3F |
| <uconv_class> "MBCS" |
| |
| <mb_cur_max> 3 |
| <mb_cur_min> 1 |
| |
| <icu:state> 0-8d, 8e:2, 8f:3, 90-9f, a1-fe:1 |
| <icu:state> a1-fe |
| <icu:state> a1-e4 |
| <icu:state> a1-fe:1, a1:4, a3-af:4, b6:4, d6:4, da-db:4, ed-f2:4 |
| <icu:state> a1-fe |
| ``` |
| |
| The binary converter file produced by the `makeconv` tool is |
| `solaris-eucJP-2.7.cnv`. |
| |
| #### Installation |
| |
| Copy the new .cnv file to the desired location for use. Set the environment |
| variable `ICU_DATA` to the directory containing the data, or, alternatively, |
| from within an application, tell ICU the location of the new data with the |
| function `u_setDataDirectory()` before using the new converter. |
| |
| If ICU is already obtaining data from files rather than a shared library, |
| install the new file in the same location as the existing ICU data file(s), and |
| don't change/set the environment variable or data directory. |
| |
| If you do not want to add a converter to ICU's base data, you can also generate |
| a conversion table with `makeconv`, use pkgdata to generate your own package and |
| use the `ucnv_openPackage()` to open up a converter with that conversion table |
| from the generated package. |
| |
| #### Building the new converter into ICU |
| |
| The need to install a separate file and inform ICU of the data directory can be |
| avoided by building the new converter into ICU's standard data library. Here is |
| the procedure for doing so: |
| |
| 1. Move the .ucm file(s) for the converter(s) to be added ( |
| `solaris-eucJP-2.7.ucm` for our example) into the directory |
| `source/data/mappings/` |
| |
| 2. Create, or edit, if it already exists, the file |
| `source/data/mappings/ucmlocal.mk`. Add this line: |
| |
| ``` |
| UCM_SOURCE_LOCAL = solaris-eucJP-2.7.ucm |
| ``` |
| |
| Any number of converters can be listed. Extend the list to new lines with a |
| back slash at the end of the line. The `ucmlocal.mk` file is described in |
| more detail in `source/data/mappings/ucmfiles.mk` (Even though they use very |
| different build systems, `ucmlocal.mk` is used for both the Windows and UNIX |
| builds.) |
| |
| 3. Add the converter name and aliases to `source/data/mappings/convrtrs.txt`. |
| This will allow your converter to be shown in the list of available |
| converters when you call the `ucnv_getAvailableName(`) function. The file |
| syntax is described within the file. |
| |
| 4. Rebuild the ICU data. |
| For Windows, from MSVC choose the makedata project from the GUI, then build |
| the project. |
| For UNIX, `cd icu/source/data; gmake` |
| |
| When opening an ICU converter (`ucnv_open()`), the converter name can not be |
| qualified with a path that indicates the directory or common data file |
| containing the corresponding converter data. The required data must be present |
| either in the main ICU data library or as a separate .cnv file located in the |
| ICU data directory. This is different from opening resources or other types of |
| ICU data, which do allow a path. |
| |
| ### Adding Locale Data to ICU's Data |
| |
| If you have data for a locale that is not included in ICU's standard build, then |
| you can add it to the build in a very similar way as with conversion tables |
| above. The ICU project provides a large number of additional locales in its |
| [locale |
| repository](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/locales/) |
| on the web. Most of this locale data is derived from the CLDR ([Common Locale |
| Data Repository](http://www.unicode.org/cldr/)) project. |
| |
| Dropping the txt file into the correct place in the source tree is sufficient to |
| add it to your ICU build. You will need to re-configure in order to pick it up. |
| |
| ## Customizing ICU's Data Library for ICU 63 or earlier |
| The ICU data library can be easily customized, either by adding additional converters or locales, or by removing some of the standard ones for the purpose of saving space. |
| |
| > :point_right: **Note**: ICU for C by default comes with pre-built data. |
| The source data files are included as an "icu\*data.zip" file starting in ICU4C |
| 49. Previously, they were not included unless ICU is downloaded from the |
| [source repository](https://github.com/unicode-org/icu). Alternatively, the |
| [Data Customizer](http://apps.icu-project.org/datacustom/) may be used to |
| customize the pre-built data. |
| |
| ICU can load data from individual data files as well as from its default |
| library, so building a customized library when adding additional data is not |
| strictly necessary. Adding to ICU's library can simplify application |
| installation by eliminating the need to include separate files with an |
| application distribution, and the need to tell ICU where they are installed. |
| |
| Reducing the size of ICU's data by eliminating unneeded resources can make |
| sense on small systems with limited or no disk, but for desktop or server |
| systems there is no real advantage to trimming. ICU's data is memory mapped |
| into an application's address space, and only those portions of the data |
| actually being used are ever paged in, so there are no significant RAM savings. |
| As for disk space, with the large size of today's hard drives, saving a few MB |
| is not worth the bother. |
| |
| By default, ICU builds with a large set of converters and with all available |
| locales. This means that any extra items added must be provided by the |
| application developer. There is no extra ICU-supplied data that could be |
| specified. |
| |
| ### Details |
| |
| The converters and resources that ICU builds are in the following configuration |
| files. They are only available when building from ICU's source code repository. |
| Normally, the standard ICU distribution do not include these files. |
| |
| File | Description |
| ----------------------------------|-------------- |
| source/data/locales/resfiles.mk | The standard set of locale data resource bundles |
| source/data/locales/reslocal.mk | User-provided file with additional resource bundles |
| source/data/coll/colfiles.mk | The standard set of collation data resource bundles |
| source/data/coll/collocal.mk | User-provided file with additional collation resource bundles |
| source/data/brkitr/brkfiles.mk | The standard set of break iterator data resource bundles |
| source/data/brkitr/brklocal.mk | User-provided file with additional break iterator resource bundles |
| source/data/translit/trnsfiles.mk | The standard set of transliterator resource files |
| source/data/translit/trnslocal.mk | User-provided file with a set of additional transliterator resource files |
| source/data/mappings/ucmcore.mk | Core set of conversion tables for MIME/Unix/Windows |
| source/data/mappings/ucmfiles.mk | Additional, large set of conversion tables for a wide range of uses |
| source/data/mappings/ucmebcdic.mk | Large set of EBCDIC conversion tables |
| source/data/mappings/ucmlocal.mk | User-provided file with additional conversion tables |
| source/data/misc/miscfiles.mk | Miscellaneous data, like timezone information |
| |
| These files function identically for both Windows and UNIX builds of ICU. ICU |
| will automatically update the list of installed locales returned by |
| `uloc_getAvailable()` whenever `resfiles.mk` or `reslocal.mk` are updated and |
| the ICU data library is rebuilt. These files are only needed while building ICU. |
| If any of these files are removed or renamed, the size of the ICU data library |
| will be reduced. |
| |
| The optional files `reslocal.mk` and `ucmlocal.mk` are not included as part of |
| a standard ICU distribution. Thus these customization files do not need to be |
| merged or updated when updating versions of ICU. |
| |
| Both `reslocal.mk` and `ucmlocal.mk` are makefile includes. So the usual rules |
| for makefiles apply. Lines may be continued by preceding the end of the line to |
| be continued with a back slash. Lines beginning with a # are comments. See |
| `ucmfiles.mk` and `resfiles.mk` for additional information. |
| |
| ### Reducing the Size of ICU's Data: Conversion Tables |
| |
| The size of the ICU data file in the standard build configuration is about 8 MB. |
| The majority of this is used for conversion tables. ICU comes with so many |
| conversion tables because many ICU users need to support many encodings from |
| many platforms. There are conversion tables for EBCDIC and DOS codepages, for |
| ISO 2022 variants, and for small variations of popular encodings. |
| |
| > :point_right: **Important**: ICU provides full internationalization |
| functionality without **any** conversion table data. The common library |
| contains code to handle several important encodings algorithmically: US-ASCII, |
| ISO-8859-1, UTF-7/8/16/32, SCSU, BOCU-1, CESU-8, and IMAP-mailbox-name (i.e., |
| US-ASCII, ISO-8859-1, and all Unicode charsets; see |
| source/data/mappings/convrtrs.txt for the current list). |
| |
| Therefore, the easiest way to reduce the size of ICU's data by a lot (without |
| limitation of I18N support) is to reduce the number of conversion tables that |
| are built into the data file. |
| |
| The conversion tables are listed for the build process in several makefiles |
| `source/data/mappings/ucm\*.mk`, roughly grouped by how commonly they are used. |
| If you remove or rename any of these files, then the ICU build will exclude the |
| conversion tables that are listed in that file. Beginning with ICU 2.0, all of |
| these makefiles including the main one are optional. If you remove all of them, |
| then ICU will include only very few conversion tables for "fallback" encodings |
| (see note below). |
| |
| If you remove or rename all `ucm\*.mk` files, then ICU's data is reduced to |
| about 3.6 MB. If you remove all these files except for `ucmcore.mk`, then ICU's |
| data is reduced to about 4.7 MB, while keeping support for a core set of common |
| MIME/Unix/Windows encodings. |
| |
| > :point_right: **Note**: If you remove the conversion table for an encoding |
| that could be a default encoding on one of your platforms, then ICU will not be |
| able to instantiate a default converter. In this case, ICU 2.0 and up will |
| automatically fall back to a "lowest common denominator" and load a converter |
| for US-ASCII (or, on EBCDIC platforms, for codepages 37 or 1047). This will be |
| good enough for converting strings that contain only "ASCII" characters (see the |
| comment about "invariant characters" in `utypes.h`). |
| *When ICU is built with a reduced set of conversion tables, then some tests will |
| fail that test the behavior of the converters based on known features of some |
| encodings. Also, building the testdata will fail if you remove some conversion |
| tables that are necessary for that (to test non-ASCII/Unicode resource bundle |
| source files, for example). You can ignore these failures. Build with the |
| standard set of conversion tables, if you want to run the tests.* |
| |
| ### Reducing the Size of ICU's Data: Locale Data |
| |
| If you need to reduce the size of ICU's data even further, then you need to |
| remove other files or parts of files from the build as well. |
| |
| There are a number of different subdirectories of 'data' containing locale data |
| split out by section. Each subdirectory has its own **.mk** file listing the |
| locales which will be built. Subdirectories include **lang** for language names |
| and **curr** for currency names. |
| |
| You can remove data for entire locales by removing their files from |
| `source/data/locales/resfiles.mk` or the appropriate other .mk file. ICU will |
| then use the data of the parent locale instead, which is root.txt. If you |
| remove all resource bundles for a given language and its country/region/variant |
| sublocales, **do not remove root.txt!** Also, do not remove a parent locale if |
| child locales exist. For example, do not remove "en" while retaining "en_US". |
| |
| ### Reducing the Size of ICU's Data: Collation Data |
| |
| Collation data (for sorting, searching and alphabetic indexes) is also large, |
| especially the collation data for East Asian languages because they define |
| multiple orderings of tens of thousands of Han characters. You can remove the |
| collation data for those languages by removing references to those locales from |
| `source/data/coll/colfiles.mk` files. When you do that, the collation for those |
| languages will fall back to the root collator, that is, you lose |
| language-specific behavior. |
| |
| A much less radical approach is to keep the collation data tables but remove the |
| tailoring rule strings from which they were built. Those rule strings are |
| rarely used at runtime. For documentation about their use and how to remove |
| them see the section "Building on Existing Locales" in the |
| [Collation Customization chapter](collation/customization/index.md). |
| |
| ### Adding Locale Data to ICU's Data |
| You need to write a resource bundle file for it with a structure like the |
| existing locale resource bundles (e.g. `source/data/locales/ja.txt, ru_RU.txt`, |
| `kok_IN.txt`) and add it by writing a file `source/data/locales/reslocal.mk` |
| just like above. In this file, define the list of additional resource bundles as |
| |
| ``` |
| GENRB_SOURCE_LOCAL=myLocale.txt other.txt ... |
| ``` |
| |
| Starting in ICU 2.2, these added locales are automatically listed by |
| `uloc_getAvailable()`. |
| |
| ## ICU Data File Formats |
| |
| ICU uses several kinds of data files with specific source (plain text) and |
| binary data formats. The following lists provides links to descriptions of those |
| formats. |
| |
| Each ICU data object begins with a header before the actual, specific data. The |
| header consists of a 16-bit header length value, the two "magic" bytes DA 27 and |
| a [UDataInfo](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/structUDataInfo.html#_details) |
| structure which specifies the data object's endianness, charset family, format, |
| data version, etc. |
| |
| (This is not the case for the trie structures, which are not stand-alone, |
| loadable data objects.) |
| |
| ### Public Data Files |
| |
| #### ICU.dat package files |
| * Source format: (list of files provided as input to the icupkg tool, or |
| on the gencmn tool command line) |
| * Binary format: .dat: [source/tools/toolutil/pkg_gencmn.cpp](../../icu4c/source/tools/toolutil/pkg_gencmn.cpp) |
| * Generator tool: [icupkg](../../icu4c/source/tools/icupkg) or |
| [gencmn](../../icu4c/source/tools/gencmn) |
| |
| #### Resource bundles |
| * Source format: .txt: [icuhtml/design/bnf_rb.txt](https://github.com/unicode-org/icu-docs/blob/master/design/bnf_rb.txt) |
| * Binary format: .res: [source/common/uresdata.h](../../icu4c/source/common/uresdata.h) |
| * Generator tool: [genrb](../../icu4c/source/tools/genrb) |
| |
| #### Unicode conversion mapping tables |
| * Source format: .ucm: [Conversion Data chapter](conversion/data.md) |
| * Binary format: .cnv: [source/common/ucnvmbcs.h](../../icu4c/source/common/ucnvmbcs.h) |
| * Generator tool: [makeconv](../../icu4c/source/tools/makeconv) |
| |
| #### Conversion (charset) aliases |
| * Source format: [source/data/mappings/convrtrs.txt](../../icu4c/source/data/mappings/convrtrs.txt): |
| contains format description. The command "uconv -l --canon" |
| will also generate the alias table from the currently used |
| copy of ICU. |
| * Binary format: cnvalias.icu: [source/common/ucnv_io.cpp](../../icu4c/source/common/ucnv_io.cpp) |
| * Generator tool: [gencnval](../../icu4c/source/tools/gencnval) |
| |
| #### Unicode Character Data (Properties; for Java only: hardcoded in C common library) |
| * Source format: [source/data/unidata/ppucd.txt](../../icu4c/source/data/unidata/ppucd.txt): |
| [Preparsed UCD](http://site.icu-project.org/design/props/ppucd) |
| * Binary format: uprops.icu: [tools/unicode/c/genprops/corepropsbuilder.cpp](../../tools/unicode/c/genprops/corepropsbuilder.cpp) |
| * Generator tool: [genprops](../../tools/unicode/c/genprops) |
| |
| #### Unicode Character Data (Case mappings; for Java only: hardcoded in C common library) |
| * Source format: [source/data/unidata/*.txt](../../icu4c/source/data/unidata): |
| [Unicode Character Database](http://www.unicode.org/onlinedat/online.html) |
| * Binary format: ucase.icu: [tools/unicode/c/genprops/casepropsbuilder.cpp](../../tools/unicode/c/genprops/casepropsbuilder.cpp) |
| * Generator tool: [genprops](../../tools/unicode/c/genprops) |
| |
| #### Unicode Character Data (BiDi, and Arabic shaping; for Java only: hardcoded in C common library) |
| * Source format: [source/data/unidata/*.txt](../../icu4c/source/data/unidata): |
| [Unicode Character Database](http://www.unicode.org/onlinedat/online.html) |
| * Binary format: ubidi.icu: [tools/unicode/c/genprops/bidipropsbuilder.cpp](../../tools/unicode/c/genprops/bidipropsbuilder.cpp) |
| * Generator tool: [genprops](../../tools/unicode/c/genprops) |
| |
| #### Unicode Character Data (Normalization since ICU 4.4) & custom normalization data |
| * Source format: [source/data/unidata/norm2/*.tx](../../icu4c/source/data/unidata/norm2): |
| Files derived from the [Unicode Character Database](http://www.unicode.org/onlinedat/online.html), |
| or custom data. |
| * Binary format: .nrm: [source/common/normalizer2impl.h](../../icu4c/source/common/normalizer2impl.h) |
| * Generator tool: [gennorm2](../../icu4c/source/tools/gennorm2) |
| |
| #### Unicode Character Data (Character names) |
| * Source format: [source/data/unidata/UnicodeData.txt](../../icu4c/source/data/unidata/UnicodeData.txt): |
| [Unicode Character Database](http://www.unicode.org/onlinedat/online.html) |
| * Binary format: unames.icu: [tools/unicode/c/genprops/namespropsbuilder.cpp](../../tools/unicode/c/genprops/namespropsbuilder.cpp) |
| * Generator tool: [genprops](../../tools/unicode/c/genprops) |
| |
| #### Unicode Character Data (Property [value] aliases since ICU 4.8; for Java only: hardcoded in C common library since ICU 4.8) |
| * Source format: [UCD Property*Aliases.txt](http://www.unicode.org/Public/UNIDATA/): |
| [Unicode Character Database](http://www.unicode.org/onlinedat/online.html) |
| * Binary format: pnames.icu: [source/common/propname.h](../../icu4c/source/common/propname.h) |
| * Generator tool: [genprops](../../tools/unicode/c/genprops) |
| |
| #### Unicode Character Data (Text layout properties since ICU 64) |
| * Source format: [source/data/unidata/ppucd.txt](../../icu4c/source/data/unidata/ppucd.txt): |
| [Preparsed UCD](http://site.icu-project.org/design/props/ppucd) |
| * Binary format: ulayout.icu: [tools/unicode/c/genprops/layoutpropsbuilder.cpp](../../tools/unicode/c/genprops/layoutpropsbuilder.cpp) |
| * Generator tool: [genprops](../../tools/unicode/c/genprops) |
| |
| #### Collation data (root collation & tailorings; ICU 53 & later) |
| * Source format: Original data from allkeys_CLDR.txt in [CLDR Root Collation Data Files](http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Data_Files) |
| processed into [source/data/unidata/FractionalUCA.txt](../../icu4c/source/data/unidata/FractionalUCA.txt) by |
| [tool at unicode.org maintained by Mark Davis](https://sites.google.com/site/unicodetools/#TOC-UCA) |
| (call the Main class with option writeFractionalUCA); |
| source tailorings (text rules) in [source/data/coll/*.txt](../../icu4c/source/data/coll) resource bundles: |
| [Collation Customization chapter](collation/customization/index.md). |
| * Binary format: ucadata.icu & binary tailorings in resource bundles: |
| [source/i18n/collationdatareader.h](../../icu4c/source/i18n/collationdatareader.h) |
| * Generator tool: [genuca](../../tools/unicode/c/genuca), [genrb](../../icu4c/source/tools/genrb) |
| |
| #### Rule-based break iterator data |
| * Source format: .txt: [Boundary Analysis chapter](boundaryanalysis/index.md) |
| * Binary format: .brk: [source/common/rbbidata.h](../../icu4c/source/common/rbbidata.h) |
| * Generator tool: [genbrk](../../icu4c/source/tools/genbrk) |
| |
| #### Dictionary-based break iterator data (ICU 50 & later) |
| * Source format: txt: [gendict.cpp comments](../../icu4c/source/tools/gendict/gendict.cpp) |
| * Binary format: .dict: see [source/common/dictionarydata.h](../../icu4c/source/common/dictionarydata.h |
| * Generator tool: [gendict](../../icu4c/source/tools/gendict) |
| |
| #### Rule-based transform (transliterator) data |
| * Source format: .txt (in resource bundles): [Transform Rule Tutorial chapter](transforms/general/rules.md) |
| * Binary format: Uses genrb to make binary format |
| * Generator tool: Does not apply |
| |
| #### Time zone data (ICU 4.4 & later) |
| * Source format: [source/data/misc/zoneinfo64.txt](../../icu4c/source/data/misc/zoneinfo64.txt): |
| ftp://elsie.nci.nih.gov/pub/ tzdata<year><rev>.tar.gz |
| * Binary format: zoneinfo64.res (generated by genrb and [tzcode tools](../../icu4c/source/tools/tzcode/readme.txt)). |
| * Generator tool: Does not apply |
| |
| #### StringPrep profile data |
| * Source format: [source/data/sprep/rfc3491.txt](../../icu4c/source/data/sprep/rfc3491.txt): |
| * Binary format: .spp: [source/tools/gensprep/store.c](../../icu4c/source/tools/gensprep/store.c) |
| * Generator tool: [gensprep](../../icu4c/source/tools/gensprep) |
| |
| #### Confusables data |
| * Source format: [source/data/unidata/confusables.txt](../../icu4c/source/data/unidata/confusables.txt), |
| [source/data/unidata/confusablesWholeScript.txt](../../icu4c/source/data/unidata/confusablesWholeScript.txt) |
| * Binary format: .spp: [confusables.cfu: source/i18n/uspoof_impl.h](../../icu4c/source/i18n/uspoof_impl.h) |
| * Generator tool: [gencfu](../../icu4c/source/tools/gencfu) |
| |
| ### Public Data Files (old versions) |
| |
| #### Unicode Character Data (Normalization before ICU 4.4; for Java only: was hardcoded in C common library) |
| * Source format: [source/data/unidata/*.txt]((../../icu4c/source/data/unidata): |
| [Unicode Character Database](http://www.unicode.org/onlinedat/online.html) |
| * Binary format: unorm.icu: [source/common/unormimp.h](../../icu4c/source/common/unormimp.h) |
| * Generator tool: gennorm |
| |
| #### Unicode Character Data (Property [value] aliases before ICU 4.8) |
| * Source format: source/data/unidata/Property*Aliases.txt: [Unicode Character Database](http://www.unicode.org/onlinedat/online.html) |
| * Binary format: pnames.icu: source/common/propname.h (ICU 4.6) |
| * Generator tool: genpname |
| |
| #### Collation data (UCA, code points to weights; ICU 52 & earlier) |
| * Source format: Same as in ICU 53 |
| * Binary format: ucadata.icu & binary tailorings in resource bundles: source/i18n/ucol_imp.h (ICU 52) |
| * Generator tool: [genuca](../../tools/unicode/c/genuca), [genrb](../../icu4c/source/tools/genrb) |
| |
| #### Collation data (Inverse UCA, weights->code points; ICU 52 & earlier) |
| * Source format: Processed from FractionalUCA.txt like ICU 52 ucadata.icu |
| * Binary format: invuca.icu: source/i18n/ucol_imp.h (ICU 52) |
| * Generator tool: [genuca](../../tools/unicode/c/genuca) |
| |
| #### Dictionary-based break iterator data (ICU 49 & earlier) |
| * Source format: .txt: genctd.cpp comments |
| * Binary format: ctd: see CompactTrieHeader in source/common/triedict.cpp |
| * Generator tool: genctd |
| |
| #### Time zone data (Before ICU 4.4) |
| * Source format: .source/data/misc/zoneinfo.txt (ICU 4.2): ftp://elsie.nci.nih.gov/pub/ tzdata<year><rev>.tar.gz |
| * Binary format: zoneinfo64.res (generated by genrb and [tzcode tools](../../icu4c/source/tools/tzcode/readme.txt)). |
| * Generator tool: Does not apply |
| |
| ### Non-File API Binary Data |
| |
| #### Converter selector data |
| * Source format: none |
| * Binary format: [source/common/ucnvsel.cpp](../../icu4c/source/common/ucnvsel.cpp) |
| * Generator tool: [ucnvsel_open()](../../icu4c/source/common/ucnvsel.cpp) |
| |
| ### Test-Only Data Files |
| |
| #### test.icu (for udata API testing) |
| * Source format: none (fixed output from gentest when not using -r or -j options) |
| * Binary format: test.icu: see `createData()` |
| in [source/tools/gentest/gentest.c](../../icu4c/source/tools/gentest/gentest.c) |
| * Generator tool: [gentest](../../icu4c/source/tools/gentest/gentest.c) |
| |
| ### Other Data Structures |
| |
| #### UCPTrie (C)/CodePointTrie (Java) (maps code points to integers) |
| * Source format: (public builder API) |
| * Binary format: [ICU Code Point Tries design doc](http://site.icu-project.org/design/struct/utrie), |
| [icu4c/source/common/ucptrie_impl.h](../../icu4c/source/common/ucptrie_impl.h) |
| * Generator tool: (builder class) |
| |
| #### UTrie2 (C)/Trie2 (Java) (maps code points to integers) |
| * Source format: (internal builder API) |
| * Binary format: [ICU Code Point Tries design doc](http://site.icu-project.org/design/struct/utrie), |
| [icu4c/source/common/utrie2_impl.h](../../icu4c/source/common/utrie2_impl.h) |
| * Generator tool: (builder class) |
| |
| #### BytesTrie (maps byte sequences to 32-bit integers) |
| * Source format: (public builder API) |
| * Binary format: [BytesTrie design doc](http://site.icu-project.org/design/struct/tries/bytestrie), |
| [icu4c/source/common/unicode/bytestrie.h](../../icu4c/source/common/unicode/bytestrie.h) |
| * Generator tool: (builder class) |
| |
| #### UCharsTrie (C++)/CharsTrie (Java) (maps 16-bit-Unicode strings to 32-bit integers) |
| * Source format: (public builder API) |
| * Binary format: [UCharsTrie design doc](http://site.icu-project.org/design/struct/tries/ucharstrie), |
| [icu4c/source/common/unicode/ucharstrie.h](../../icu4c/source/common/unicode/ucharstrie.h) |
| * Generator tool: (builder class) |
| |
| ## ICU4J Resource Information |
| |
| Starting with release 2.1, ICU4J includes its own resource information which is |
| completely independent of the JRE resource information. (Note, ICU4J 2.8 to 3.4, |
| time zone information depends on the underlying JRE). The new ICU4J information |
| is equivalent to the information in ICU4C and many resources are, in fact, the |
| same binary files that ICU4C uses. |
| |
| By default the ICU4J distribution includes all of the standard resource |
| information. It is located under the directory `com/ibm/icu/impl/data`. |
| Depending on the service, the data is in different locations and in different |
| formats. Note: This will continue to change from release to release, so clients |
| should not depend on the exact organization of the data in ICU4J. |
| |
| 1. The primary **locale data** is under the directory icudt38b, as a set of |
| ".res" files whose names are the locale identifiers. Locale naming is |
| documented in the `com.ibm.icu.util.ULocale` class, and the use of these |
| names in searching for resources is documented in |
| `com.ibm.icu.util.UResourceBundle`. |
| |
| 2. The **collation data** is under the directory `icudt38b/coll`, as a set of |
| ".res" files. |
| |
| 3. The **rule-based transliterator data** is under the directory |
| `icudt38b/translit` as a set of ".res" files. (**Note:** the Han |
| transliterator test data is no longer included in the core icu4j.jar file by |
| default.) |
| |
| 4. The **rule-based number format data** is under the directory `icudt38b/rbnf` |
| as a set of ".res" files. |
| |
| 5. The **break iterator data** is directly under the data directory, as a set |
| of ".brk" files, named according to the type of break and the locale where |
| there are locale-specific versions. |
| |
| 6. The **holiday data** is under the data directory, as a set of ".class" |
| files, named "HolidayBundle_" followed by the locale ID. |
| |
| 7. The **character property data** as well as assorted **normalization data** |
| and default **unicode collation algorithm (UCA) data** is found under the |
| data directory as a set of ".icu" files. |
| |
| 8. The **character set converter data** is under the directory `icudt38b/`, as |
| a set of ".cnv" files. These files are currently included only in |
| icu-charset.jar. |
| |
| 9. The **time zone data** is named `zoneinfo.res` under the directory |
| `icudt38b`. |
| |
| Some of the data files alias or otherwise reference data from other data files. |
| One reason for this is because some locale names have changed. For example, |
| he_IL used to be iw_IL. In order to support both names but not duplicate the |
| data, one of the resource files refers to the other file's data. In other cases, |
| a file may alias a portion of another file's data in order to save space. |
| Currently ICU4J provides no tool for revealing these dependencies. |
| |
| > :point_right: **Note**: Java's Locale class silently converts the language |
| code "he" to "iw" when you construct the Locale (for versions of Java through |
| Java 5). Thus Java cannot be used to locate resources that use the "he" language |
| code. ICU, on the other hand, does not perform this conversion in ULocale, and |
| instead uses aliasing in the locale data to represent the same set of data under |
| different locale ids. |
| |
| Resource files that use locale ids form a hierarchy, with up to four levels: a |
| root, language, region (country), and variant. Searches for locale data attempt |
| to match as far down the hierarchy as possible, for example, "he_IL" will match |
| he_IL, but "he_US" will match he (since there is no US variant for he, and |
| "xx_YY will match root (the default fallback locale) since there is no xx |
| language code in the locale hierarchy. Again, see `java.util.ResourceBundle` for |
| more information. |
| |
| Currently ICU4J provides no tool for revealing these dependencies between data |
| files, so trimming the data directly in the ICU4J project is a hit-or-miss |
| affair. The key point when you remove data is to make sure to remove all |
| dependencies on that data as well. For example, if you remove he.res, you need |
| to remove he_IL.res, since it is lower in the hierarchy, and you must remove |
| iw.res, since it references he.res, and iw_IL.res, since it depends on it (and |
| also references he_IL.res). |
| |
| Unfortunately, the jar tool in the JDK provides no way to remove items from a |
| jar file. Thus you have to extract the resources, remove the ones you don't |
| want, and then create a new jar file with the remaining resources. See the jar |
| tool information for how to do this. Before 'rejaring' the files, be sure to |
| thoroughly test your application with the remaining resources, making sure each |
| required resource is present. |
| |
| #### Using additional resource files with ICU4J |
| |
| > :point_right: **Note**: Resource file formats can change across releases of ICU4J! |
| > |
| > *The format of ICU4J resources is not part of the API. Clients who develop their |
| > own resources for use with ICU4J should be prepared to regenerate them when they |
| > move to new releases of ICU4J.* |
| |
| We are still developing ICU4J's resource mechanism. Currently it is not possible |
| to mix icu's new binary .res resources with traditional java-style .class or |
| .txt resources. We might allow for this in a future release, but since the |
| resource data and format is not formally supported, you run the risk of |
| incompatibilities with future releases of ICU4J. |
| |
| Resource data in ICU4J is checked in to the repository as a jar file containing |
| the resource binaries, icudata.jar. This means that inspecting the contents of |
| these resources is difficult. They currently are compiled from ICU4C .txt file |
| data. You can view the contents of the ICU4C text resource files to understand |
| the contents of the ICU4J resources. |
| |
| The files in icudata.jar get extracted to com/ibm/icu/impl/data in the build |
| directory when the 'core' target is built. Building the 'resources' target will |
| force the resources to once again be extracted. Extraction will overwrite any |
| corresponding resource files already in that directory. |
| |
| ### Building ICU4J Resources from ICU4C |
| |
| #### Requirements |
| |
| 1. [ICU4C](http://icu-project.org/download/) |
| |
| 2. Compilers and tools required for [building ICU4C](https://htmlpreview.github.io/?https://github.com/unicode-org/icu/blob/master/icu4c/readme.html#HowToBuild). |
| |
| 3. J2SE SDK version 5 or above |
| |
| #### Procedure |
| |
| 1. Download and build ICU4C on a Windows or Linux machine. For instructions on downloading and building ICU4C, please click |
| [here](https://htmlpreview.github.io/?https://github.com/unicode-org/icu/blob/master/icu4c/readme.html#HowToBuild). |
| |
| 2. Follow the remaining instructions in |
| [*$icu4c_root*/source/data/icu4j-readme.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/icu4j-readme.txt). |
| *$icu4c_root* is the root directory of ICU4C source package. |