docs/userguide/icudata.md - external/github.com/unicode-org/icu - Git at Google

 ---
 layout: default
 title: ICU Data
 nav_order: 13
 has_children: true
 ---
 <!--
 © 2020 and later: Unicode, Inc. and others.
 License & terms of use: http://www.unicode.org/copyright.html
 -->

 # ICU Data
 {: .no_toc }

 ## Contents
 {: .no_toc .text-delta }

 1. TOC
 {:toc}

 ---

 ## Overview

 ICU makes use of a wide variety of data tables to provide many of its services.
 Examples include converter mapping tables, collation rules, transliteration
 rules, break iterator rules and dictionaries, and other locale data. Additional
 data can be provided by users, either as customizations of ICU's data or as new
 data altogether.

 This section describes how ICU data is stored and located at run time. It also
 describes how ICU data can be customized to suit the needs of a particular
 application.

 For simple use of ICU's predefined data, this section on data management can
 safely be skipped. The data is built into a library that is loaded along with
 the rest of ICU. No specific action or setup is required of either the
 application program or the execution environment.

 Update: as of ICU 64, the standard data library is over 20 MB in size. We have
 introduced a new tool, the [ICU Data Build Tool](./icu_data/buildtool.md),
 to give you more control over what goes into your ICU locale data file.

 > :point_right: **Note**: ICU for C by default comes with pre-built data.
 > The source data files are included as an "icu\*data.zip" file starting in ICU4C 49.
 > Previously, they were not included unless ICU is downloaded from the [source repository](http://site.icu-project.org/repository).

 ## ICU and CLDR Data

 Most of ICU's data is sourced from [CLDR](http://cldr.unicode.org), the [Common
 Locale Data Repository](http://cldr.unicode.org) project. Do not file bugs
 against ICU to request data changes in CLDR, see the CLDR project's page itself.
 Also note that most ICU data files are therefore autogenerated from CLDR, and so
 manually editing them is not usually recommended.

 Data which is NOT sourced from CLDR includes:

 *   [Conversion Data](conversion/data.md)
 *   Break Iterator Dictionary Data ( Thai, CJK, etc )
 *   Break Iterator Rule Data (as of this writing, it is manually kept in sync
     with the CLDR datasets)

 For information on building ICU data from CLDR, see the
 [cldr-icu-readme](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/cldr-icu-readme.txt).

 ## ICU Data Directory

 The ICU data directory is the default location for all ICU data. Any requests
 for data items that do not include an explicit directory path will be resolved
 to files located in the ICU data directory.

 The ICU data directory is determined as follows:

 1.  If the application has called the function `u_setDataDirectory()`, use the
     directory specified there, otherwise:

 2.  If the environment variable `ICU_DATA` is set, use that, otherwise:

 3.  If the C preprocessor variable `ICU_DATA_DIR` was set at the time ICU was
     built, use its compiled-in value.

 4.  Otherwise, the ICU data directory is an empty string. This is the default
     behavior for ICU using a shared library for its data and provides the
     highest data loading performance.

 > :point_right: **Note**: `u_setDataDirectory()` is not thread-safe. Call it
 > *before* calling ICU APIs from multiple threads. If you use both
 > `u_setDataDirectory()` and `u_init()`, then use `u_setDataDirectory()` first.
 >
 > *Earlier versions of ICU supported two additional schemes: setting a data
 > directory relative to the location of the ICU shared libraries, and on Windows,
 > taking a location from the registry. These have both been removed to make the
 > behavior more predictable and easier to understand.*

 The ICU data directory does not need to be set in order to reference the
 standard built-in ICU data. Applications that just use standard ICU capabilities
 (converters, locales, collation, etc.) but do not build and reference their own
 data do not need to specify an ICU data directory.

 ### Multiple-Item ICU Data Directory Values

 The ICU data directory string can contain multiple directories as well as .dat
 path/filenames. They must be separated by the path separator that is used on the
 platform, for example a semicolon (`;`) on Windows. Data files will be searched in
 all directories and .dat package files in the order of the directory string. For
 details, see the example below.

 ## Default ICU Data

 The default ICU data consists of the data needed for the converters, collators,
 locales, etc. that are provided with ICU. Default data must be present in order
 for ICU to function.

 The default data is most commonly built into a shared library that is installed
 with the other ICU libraries. Nothing is required of the application for this
 mechanism to work. ICU provides additional options for loading the default data
 if more flexibility is required.

 Here are the steps followed by ICU to locate its default data. This procedure
 happens only once per process, at the time an ICU data item is first requested.

 1.  If the application has called the function `udata_setCommonData()`, use the
     data that was provided. The application specifies the address in memory of
     an image of an ICU common format data file (either in shared-library format
     or .dat package file format).

 2.  Examine the contents of the default ICU data shared library. If it contains
     data, use that data. If the data library is empty, a stub library, proceed
     to the next step. (A data shared library must always be present in order for
     ICU to successfully link and load. A stub data library is used when the
     actual ICU common data is to be provided from another source).

 3.  Dynamically load (memory map, typically) a common format (.dat) file
     containing the default ICU data. Loading is described in the section
     [How Data Loading Works](icudata#how-data-loading-works). The path to
     the data is of the form  "icudt\<version\>\<flag\>", where \<version\> is
     the two-digit ICU version number, and \<flag\> is a letter indicating the
     internal format of the file (see the
     [Sharing ICU Data Between Platforms](icudata#sharing-icu-data-between-platforms)
     section).

 Once the default ICU data has been located, loading of individual data items
 proceeds as described in the section
 [How Data Loading Works](icudata#how-data-loading-works).

 ## Building and Linking against ICU data

 When using ICU's configure or runConfigureICU tool to build, several different
 methods of packging are available.

 > :point_right: **Note**: in all cases, you **must** link all ICU tools and
 applications against a "data library": either a data library containing the ICU
 data, or against the "stubdata" library located in icu/source/stubdata. For
 example, even if ICU is built in "files" mode, you must still link against the
 "stubdata" library or an undefined symbol error occurs.

 *   `--with-data-packaging=library`
     This mode builds a shared library (DLL or .so). This is the simplest mode to
     use, and is the default.
     To use: link your application against the common and data libraries.
     This is the only directly supported behavior on Windows builds.
 *   `--with-data-packaging=static`
     This option builds ICU data as a single (large) static library. This mode is
     more complex to use. If you encounter errors, you may need to build ICU
     multiple times.
 *   `--with-data-packaging=files`
     With this option, ICU outputs separate individual files (.res, .cnv, etc)
     which will be loaded at runtime. Read the rest of this document, especially
     the sections that discuss the ICU directory path.
 *   `--with-data-packaging=archive`
     With this option, ICU outputs a single "icudt__.dat" file containing ICU
     data. Read the rest of this document, especially the sections that discuss
     the ICU directory path.

 ## Time Zone Data

 Because time zone data requires frequent updates in response to countries
 changing their transition dates for daylight saving time, ICU provides
 additional options for loading time zone data from separate files, thus avoiding
 the need to update a combined ICU data package. Further information is found
 under [Time Zones](datetime/timezone/index.md).

 ## Application Data

 ICU-based applications can ship and use their own data for localized strings,
 custom conversion tables, etc. Each data item file must have a package name as a
 prefix, and this package name must match the basename of a .dat package file, if
 one is used. The package name must be used in ICU APIs, for example in
 `udata_setAppData()` (instead of `udata_setCommonData()` which is only used for
 ICU's own data) and in the pathname argument of `ures_open()`.

 The only real difference to ICU's own data is that application data cannot be
 simply loaded by specifying a NULL value for the path arguments of ICU APIs, and
 application data will not be used by APIs that do not have path/package name
 arguments at all.

 The most important APIs that allow application data to be used are for Resource
 Bundles, which are most often used for localized strings and other data. There
 are also functions like `ucnv_openPackage()` that allow to specify application
 data, and the `udata.h` API can be used to load any data with minimum
 requirements on the binary format, and without ICU interpreting the contents of
 the data.

 The `pkgdata` tool, which is used to package the data into various formats (e.g.
 shared library), has an option (`--without-assembly` or `-w`) to not use
 assembly code when building and packaging the application specific data into a
 shared library. Building the data with assembly code, which is enabled by
 default, is faster and more efficient; however, there are some platform
 specific issues that may arise. The `--without-assembly` option may be
 necessary on certain platforms (e.g. Linux) which have trouble properly loading
 application data when it was built with assembly code and is packaged as a
 shared library.

 ## Alignment

 ICU data is designed to be 16-aligned, with natural alignment of values inside
 the data structure, so that the data is usable as is when memory-mapped.
 ("16-aligned" means that the start address is a multiple of 16 bytes.)

 Memory-mapping (as well as memory allocation) provides at least 16-alignment on
 modern platforms. Some CPUs require n-alignment of types of size n bytes (and
 crash on unaligned reads), other CPUs usually operate faster on data that is
 aligned properly.

 Some of the ICU code explicitly checks for proper alignment.

 The `icupkg` tool places data items into the .dat file at start offsets that are
 multiples of 16 bytes.

 When using `genccode` to directly write a .o/.obj file, or to write assembler
 code, it specifies at least 16-alignment. When using `genccode` to write C code,
 it prepends the data with a double value which should yield at least 8-alignment
 on most platforms (usually `sizeof(double)=8`).

 ## Flexibility vs. Installation vs. Performance

 There are choices that affect ICU data loading and depend on application
 requirements.

 ### Data in Shared Libraries/DLLs vs. .dat package files

 Building ICU data into shared libraries (`--with-data-packaging=library`) is the
 most convenient packaging method because shared libraries (DLLs) are easily
 found if they are in the same directory as the application libraries, or if they
 are on the system library path. The application installer usually just copies
 the ICU shared libraries in the same place. On the other hand, shared libraries
 are not portable.

 Packaging data into .dat files (`--with-data-packaging=archive`) allows them to
 be shared across platforms, but they must either be loaded by the application
 and set with `udata_setCommonData()` or `udata_setAppData()`, or they must be
 in a known location that is included in the ICU data directory string. This
 requires the application installer, or the application itself at runtime, to
 locate the ICU and/or application data by setting the ICU data directory (see
 the [ICU Data Directory](icudata#icu-data-directory) section above) or by
 loading the data and providing it to one of the `udata_setXYZData()` functions.

 Unlike shared libraries, .dat package files can be taken apart into separate
 data item files with the decmn ICU tool. This allows post-installation
 modification of a package file. The `gencmn` and `pkgdata` ICU tools can then be
 used to reassemble the .dat package file.

 For more information about .dat package files see the section [Sharing ICU Data
 Between Platforms](icudata#sharing-icu-data-between-platforms) below.

 ### Data Overriding vs. Loading Performance

 If the ICU data directory string is empty, then ICU will not attempt to load
 data from the file system. It is then only possible to load data from the
 linked-in shared library or via `udata_setCommonData()` and
 `udata_setAppData()`. This is inflexible but provides the highest performance.

 If the ICU data directory string is not empty, then data items are searched in
 all directories and matching .dat files mentioned before checking in
 already-loaded package files. This allows overriding of packaged data items with
 single files after installation but costs some time for filesystem accesses.
 This is usually done only once per data item; see
 [User Data Caching](icudata#user-data-caching) below.

 ### Single Data Files vs. Packages

 Single data files (`--with-data-packaging=files`) are easy to replace and can
 override items inside data packages. However, it is usually desirable to reduce
 the number of files during installation, and package files use less disk space
 than many small files.

 ## How Data Loading Works

 ICU data items are referenced by three names - a path, a name and a type. The
 following are some examples:

 path                         |   name   | type
 -----------------------------|----------|-------
  c:\\some\\path\\dataLibName | test     | dat
  no path                     | cnvalias | icu
  no path                     | cp1252   | cnv
  no path                     | en       | res
  no path                     | uprops   | icu


 Items with 'no path' specified are loaded from the default ICU data.

 Application data items include a path, and will be loaded from user data files,
 not from the ICU default data. For application data, the path argument need not
 contain an actual directory, but must contain the application data's package
 name after the last directory separator character (or by itself if there is no
 directory). If the path argument contains a directory, then it is logically
 prepended to the ICU data directory string and searched first for data. The path
 argument can contain at most one directory. (Path separators like semicolon (;)
 are not handled here.)

 > :point_right: **Note**: The ICU data directory string itself may
 contain multiple directories and path/filenames to .dat package files. See the
 [ICU Data Directory](icudata#icu-data-directory) section.

 It is recommended to not include the directory in the path argument but to make
 sure via setting the application data or the ICU data directory string that the
 data can be located. This simplifies program maintenance and improves
 robustness.

 See the API descriptions for the functions `udata_open()` and
 `udata_openChoice()` for additional information on opening ICU data from within
 an application.

 Data items can exist as individual files, or a number of them can be packaged
 together in a single file for greater efficiency in loading and convenience of
 distribution. The combined files are called Common Files.

 Based on the supplied path and name, ICU searches several possible locations
 when opening data. To make things more concrete in the following descriptions,
 the following values of path, name and type are used:

 ```
 path = "c:\\some\\path\\dataLibName"
 name = "test"
 type = "res"
 ```

 In this case, "dataLibName" is the "package name" part of the path argument, and
 "c:\\some\\path\\" is the directory part of it.

 The search sequence for the data for "test.res" is as follows (the first
 successful loading attempt wins):

 1.  Try to load the file "dataLibName_test.res" from c:\\some\\data\\.

 2.  Try to load the file "dataLibName_test.res" from each of the directories in
     the ICU data directory string.

 3.  Try to locate the data package for the package name "dataLibName".

 1.  Try to locate the data package in the internal cache.

 2.  Try to load the package file "dataLibName.dat" from c:\\some\\data\\.

 3.  Try to load the package file "dataLibName.dat" from each of the directories
     in the ICU data directory string.

 The first steps, loading the data item from an individual file, are omitted if
 no directory is specified in either the path argument or the ICU data directory
 string.

 Package files are loaded at most once and then cached. They are identified only
 by their package name. Whenever a data item is requested from a package and that
 package has been loaded before, then the cached package is used immediately
 instead of searching through the filesystem.

 > :point_right: **Note**: ICU versions before 2.2 always searched data packages
 before looking for individual files, which made it impossible to override
 packaged data items. See the ICU 2.2 download page and the readme for more
 information about the changes.

 ## User Data Caching

 Once loaded, data package files are cached, and stay loaded for the duration of
 the process. Any requests for data items from an already loaded data package
 file are routed directly to the cached data. No additional search for loadable
 files is made.

 The user data cache is keyed by the base file name portion of the requested
 path, with any directory portion stripped off and ignored. Using the previous
 example, for the path name "c:\\some\\path\\dataLibName", the cache key is
 "dataLibName". After this is cached, a subsequent request for "dataLibName", no
 matter what directory path is specified, will resolve to the cached data.

 Data can be explicitly added to the cache of common format data by means of the
 `udata_setAppData()` function. This function takes as input the path (name) and
 a pointer to a memory image of a .dat file. The data is added to the cache,
 causing any subsequent requests for data items from that file name to be routed
 to the cache.

 Only data package files are cached. Separate data files that contain just a
 single data item are not cached; for these, multiple requests to ICU to open the
 data will result in multiple requests to the operating system to open the
 underlying file.

 However, most ICU services (Resource Bundles, conversion, etc.) themselves cache
 loaded data, so that data is usually loaded only once until the end of the
 process (or until `u_cleanup()` or `ucnv_flushCache()` or similar are called.)

 There is no mechanism for removing or updating cached data files.

 ## Directory Separator Characters

 If a directory separator (generally '/' or '\\') is needed in a path parameter,
 use the form that is native to the platform. The ICU header `"putil.h"` defines
 `U_FILE_SEP_CHAR` appropriately for the platform.

 > :point_right: **Note**: On Windows, the directory separator must be '\\' for
 any paths passed to ICU APIs. This is different from native Windows APIs, which
 generally allow either '/' or '\\'.

 ## Sharing ICU Data Between Platforms

 ICU's default data is (at the time of this writing) about 8 MB in size. Because
 it is normally built as a shared library, the file format is specific to each
 platform (operating system). The data libraries can not be shared between
 platforms even though the actual data contents are identical.

 By distributing the default data in the form of common format .dat files rather
 than as shared libraries, a single data file can be shared among multiple
 platforms. This is beneficial if a single distribution of the application (a CD,
 for example) includes binaries for many platforms, and the size requirements for
 replicating the ICU data for each platform are a problem.

 ICU common format data files are not completely interchangeable between
 platforms. The format depends on these properties of the platform:

 1.  Byte Ordering (little endian vs. big endian)

 2.  Base character set - ASCII or EBCDIC

 This means, for example, that ICU data files are interchangeable between Windows
 and Linux on X86 (both are ASCII little endian), or between Macintosh and
 Solaris on SPARC (both are ASCII big endian), but not between Solaris on SPARC
 and Solaris on X86 (different byte ordering).

 The single letter following the version number in the file name of the default
 ICU data file encodes the properties of the file as follows:

 ```
 icudt19l.dat Little Endian, ASCII
 icudt19b.dat Big Endian, ASCII
 icudt19e.dat Big Endian, EBCDIC
 ```

 (There are no little endian EBCDIC systems. All non-EBCDIC encodings include an
 invariant subset of ASCII that is sufficient to enable these files to
 interoperate.)

 The packaging of the default ICU data as a .dat file rather than as a shared
 library is requested by using an option in the configure script at build time.
 Nothing is required at run time; ICU finds and uses whatever form of the data is
 available.

 > :point_right: **Note**: When the ICU data is built in the form of shared
 libraries, the library names have platform-specific prefixes and suffixes. On
 Unix-style platforms, all the libraries have the "lib" prefix and one of the
 usual (".dll", ".so", ".sl", etc.) suffixes. Other than these prefixes and
 suffixes, the library names are the same as the above .dat files.

 ## Customizing ICU's Data Library

 ICU includes a standard library of data that is about 16 MB in size. Most of
 this consists of conversion tables and locale information. The data itself is
 normally placed into a single shared library.

 Update: as of ICU 64, the standard data library is over 20 MB in size. We have
 introduced a new tool, the [ICU Data Build Tool](icu_data/buildtool.md),
 to replace the makefiles explained below and give you more control over what
 goes into your ICU locale data file.

 ### Adding Converters to ICU

 The first step is to obtain or create a .ucm (source) mapping data file for the
 desired converter. A large archive of converter data is maintained by the ICU
 team at <https://github.com/unicode-org/icu-data/tree/master/charset/data/ucm>

 We will use `solaris-eucJP-2.7.ucm`, available from the repository mentioned
 above, as an example.

 #### Build the Converter

 Converter source files are compiled into binary converter files (.cnv files) by
 using the icu tool makeconv. For the example, you can use this command

 ```
 makeconv -v solaris-eucJP-2.7.ucm
 ```

 Some of the .ucm files from the repository will need additional header
 information before they can be built. Use the error messages from the makeconv
 tool, .ucm files for similar converters, and the ICU user guide documentation of
 .ucm files as a guide when making changes. For the `solaris-eucJP-2.7.ucm`
 example, we will borrow the missing header fields from
 `source/data/mappings/ibm-33722_P12A-2000.ucm`, which is the standard ICU eucJP
 converter data.

 The ucm file format is described in the
 ["Conversion Data" chapter](conversion/data.md) of this user guide.

 After adjustment, the header of the `solaris-eucJP-2.7.ucm` file contains these
 items:

 ```
 <code_set_name>   "solaris-eucJP-2.7"
 <subchar>         \\x3F
 <uconv_class>     "MBCS"

 <mb_cur_max>      3
 <mb_cur_min>      1

 <icu:state>       0-8d, 8e:2, 8f:3, 90-9f, a1-fe:1
 <icu:state>       a1-fe
 <icu:state>       a1-e4
 <icu:state>       a1-fe:1, a1:4, a3-af:4, b6:4, d6:4, da-db:4, ed-f2:4
 <icu:state>       a1-fe
 ```

 The binary converter file produced by the `makeconv` tool is
 `solaris-eucJP-2.7.cnv`.

 #### Installation

 Copy the new .cnv file to the desired location for use. Set the environment
 variable `ICU_DATA` to the directory containing the data, or, alternatively,
 from within an application, tell ICU the location of the new data with the
 function `u_setDataDirectory()` before using the new converter.

 If ICU is already obtaining data from files rather than a shared library,
 install the new file in the same location as the existing ICU data file(s), and
 don't change/set the environment variable or data directory.

 If you do not want to add a converter to ICU's base data, you can also generate
 a conversion table with `makeconv`, use pkgdata to generate your own package and
 use the `ucnv_openPackage()` to open up a converter with that conversion table
 from the generated package.

 #### Building the new converter into ICU

 The need to install a separate file and inform ICU of the data directory can be
 avoided by building the new converter into ICU's standard data library. Here is
 the procedure for doing so:

 1.  Move the .ucm file(s) for the converter(s) to be added (
     `solaris-eucJP-2.7.ucm` for our example) into the directory
     `source/data/mappings/`

 2.  Create, or edit, if it already exists, the file
     `source/data/mappings/ucmlocal.mk`. Add this line:

     ```
     UCM_SOURCE_LOCAL = solaris-eucJP-2.7.ucm
     ```

     Any number of converters can be listed. Extend the list to new lines with a
     back slash at the end of the line. The `ucmlocal.mk` file is described in
     more detail in `source/data/mappings/ucmfiles.mk` (Even though they use very
     different build systems, `ucmlocal.mk` is used for both the Windows and UNIX
     builds.)

 3.  Add the converter name and aliases to `source/data/mappings/convrtrs.txt`.
     This will allow your converter to be shown in the list of available
     converters when you call the `ucnv_getAvailableName(`) function. The file
     syntax is described within the file.

 4.  Rebuild the ICU data.
     For Windows, from MSVC choose the makedata project from the GUI, then build
     the project.
     For UNIX, `cd icu/source/data; gmake`

 When opening an ICU converter (`ucnv_open()`), the converter name can not be
 qualified with a path that indicates the directory or common data file
 containing the corresponding converter data. The required data must be present
 either in the main ICU data library or as a separate .cnv file located in the
 ICU data directory. This is different from opening resources or other types of
 ICU data, which do allow a path.

 ### Adding Locale Data to ICU's Data

 If you have data for a locale that is not included in ICU's standard build, then
 you can add it to the build in a very similar way as with conversion tables
 above. The ICU project provides a large number of additional locales in its
 [locale
 repository](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/locales/)
 on the web. Most of this locale data is derived from the CLDR ([Common Locale
 Data Repository](http://www.unicode.org/cldr/)) project.

 Dropping the txt file into the correct place in the source tree is sufficient to
 add it to your ICU build. You will need to re-configure in order to pick it up.

 ## Customizing ICU's Data Library for ICU 63 or earlier
 The ICU data library can be easily customized, either by adding additional converters or locales, or by removing some of the standard ones for the purpose of saving space.

 > :point_right: **Note**: ICU for C by default comes with pre-built data.
 The source data files are included as an "icu\*data.zip" file starting in ICU4C
 49. Previously, they were not included unless ICU is downloaded from the
 [source repository](https://github.com/unicode-org/icu). Alternatively, the
 [Data Customizer](http://apps.icu-project.org/datacustom/) may be used to
 customize the pre-built data.

 ICU can load data from individual data files as well as from its default
 library, so building a customized library when adding additional data is not
 strictly necessary. Adding to ICU's library can simplify application
 installation by eliminating the need to include separate files with an
 application distribution, and the need to tell ICU where they are installed.

 Reducing the size of ICU's data by eliminating unneeded resources can make
 sense on small systems with limited or no disk, but for desktop or server
 systems there is no real advantage to trimming. ICU's data is memory mapped
 into an application's address space, and only those portions of the data
 actually being used are ever paged in, so there are no significant RAM savings.
 As for disk space, with the large size of today's hard drives, saving a few MB
 is not worth the bother.

 By default, ICU builds with a large set of converters and with all available
 locales. This means that any extra items added must be provided by the
 application developer. There is no extra ICU-supplied data that could be
 specified.

 ### Details

 The converters and resources that ICU builds are in the following configuration
 files. They are only available when building from ICU's source code repository.
 Normally, the standard ICU distribution do not include these files.

 File                              | Description
 ----------------------------------|--------------
 source/data/locales/resfiles.mk   | The standard set of locale data resource bundles
 source/data/locales/reslocal.mk   | User-provided file with additional resource bundles
 source/data/coll/colfiles.mk      | The standard set of collation data resource bundles
 source/data/coll/collocal.mk      | User-provided file with additional collation resource bundles
 source/data/brkitr/brkfiles.mk    | The standard set of break iterator data resource bundles
 source/data/brkitr/brklocal.mk    | User-provided file with additional break iterator resource bundles
 source/data/translit/trnsfiles.mk | The standard set of transliterator resource files
 source/data/translit/trnslocal.mk | User-provided file with a set of additional transliterator resource files
 source/data/mappings/ucmcore.mk   | Core set of conversion tables for MIME/Unix/Windows
 source/data/mappings/ucmfiles.mk  | Additional, large set of conversion tables for a wide range of uses
 source/data/mappings/ucmebcdic.mk | Large set of EBCDIC conversion tables
 source/data/mappings/ucmlocal.mk  | User-provided file with additional conversion tables
 source/data/misc/miscfiles.mk     | Miscellaneous data, like timezone information

 These files function identically for both Windows and UNIX builds of ICU. ICU
 will automatically update the list of installed locales returned by
 `uloc_getAvailable()` whenever `resfiles.mk` or `reslocal.mk` are updated and
 the ICU data library is rebuilt. These files are only needed while building ICU.
 If any of these files are removed or renamed, the size of the ICU data library
 will be reduced.

 The optional files `reslocal.mk` and `ucmlocal.mk` are not included as part of
 a standard ICU distribution. Thus these customization files do not need to be
 merged or updated when updating versions of ICU.

 Both `reslocal.mk` and `ucmlocal.mk` are makefile includes. So the usual rules
 for makefiles apply. Lines may be continued by preceding the end of the line to
 be continued with a back slash. Lines beginning with a # are comments. See
 `ucmfiles.mk` and `resfiles.mk` for additional information.

 ### Reducing the Size of ICU's Data: Conversion Tables

 The size of the ICU data file in the standard build configuration is about 8 MB.
 The majority of this is used for conversion tables. ICU comes with so many
 conversion tables because many ICU users need to support many encodings from
 many platforms. There are conversion tables for EBCDIC and DOS codepages, for
 ISO 2022 variants, and for small variations of popular encodings.

 > :point_right: **Important**: ICU provides full internationalization
 functionality without **any** conversion table data. The common library
 contains code to handle several important encodings algorithmically: US-ASCII,
 ISO-8859-1, UTF-7/8/16/32, SCSU, BOCU-1, CESU-8, and IMAP-mailbox-name (i.e.,
 US-ASCII, ISO-8859-1, and all Unicode charsets; see
 source/data/mappings/convrtrs.txt for the current list).

 Therefore, the easiest way to reduce the size of ICU's data by a lot (without
 limitation of I18N support) is to reduce the number of conversion tables that
 are built into the data file.

 The conversion tables are listed for the build process in several makefiles
 `source/data/mappings/ucm\*.mk`, roughly grouped by how commonly they are used.
 If you remove or rename any of these files, then the ICU build will exclude the
 conversion tables that are listed in that file. Beginning with ICU 2.0, all of
 these makefiles including the main one are optional. If you remove all of them,
 then ICU will include only very few conversion tables for "fallback" encodings
 (see note below).

 If you remove or rename all `ucm\*.mk` files, then ICU's data is reduced to
 about 3.6 MB. If you remove all these files except for `ucmcore.mk`, then ICU's
 data is reduced to about 4.7 MB, while keeping support for a core set of common
 MIME/Unix/Windows encodings.

 > :point_right: **Note**: If you remove the conversion table for an encoding
 that could be a default encoding on one of your platforms, then ICU will not be
 able to instantiate a default converter. In this case, ICU 2.0 and up will
 automatically fall back to a "lowest common denominator" and load a converter
 for US-ASCII (or, on EBCDIC platforms, for codepages 37 or 1047). This will be
 good enough for converting strings that contain only "ASCII" characters (see the
 comment about "invariant characters" in `utypes.h`).
 *When ICU is built with a reduced set of conversion tables, then some tests will
 fail that test the behavior of the converters based on known features of some
 encodings. Also, building the testdata will fail if you remove some conversion
 tables that are necessary for that (to test non-ASCII/Unicode resource bundle
 source files, for example). You can ignore these failures. Build with the
 standard set of conversion tables, if you want to run the tests.*

 ### Reducing the Size of ICU's Data: Locale Data

 If you need to reduce the size of ICU's data even further, then you need to
 remove other files or parts of files from the build as well.

 There are a number of different subdirectories of 'data' containing locale data
 split out by section. Each subdirectory has its own **.mk** file listing the
 locales which will be built. Subdirectories include **lang** for language names
 and **curr** for currency names.

 You can remove data for entire locales by removing their files from
 `source/data/locales/resfiles.mk` or the appropriate other .mk file. ICU will
 then use the data of the parent locale instead, which is root.txt. If you
 remove all resource bundles for a given language and its country/region/variant
 sublocales, **do not remove root.txt!** Also, do not remove a parent locale if
 child locales exist. For example, do not remove "en" while retaining "en_US".

 ### Reducing the Size of ICU's Data: Collation Data

 Collation data (for sorting, searching and alphabetic indexes) is also large,
 especially the collation data for East Asian languages because they define
 multiple orderings of tens of thousands of Han characters. You can remove the
 collation data for those languages by removing references to those locales from
 `source/data/coll/colfiles.mk` files. When you do that, the collation for those
 languages will fall back to the root collator, that is, you lose
 language-specific behavior.

 A much less radical approach is to keep the collation data tables but remove the
 tailoring rule strings from which they were built. Those rule strings are
 rarely used at runtime. For documentation about their use and how to remove
 them see the section "Building on Existing Locales" in the
 [Collation Customization chapter](collation/customization/index.md).

 ### Adding Locale Data to ICU's Data
 You need to write a resource bundle file for it with a structure like the
 existing locale resource bundles (e.g. `source/data/locales/ja.txt, ru_RU.txt`,
 `kok_IN.txt`) and add it by writing a file `source/data/locales/reslocal.mk`
 just like above. In this file, define the list of additional resource bundles as

 ```
 GENRB_SOURCE_LOCAL=myLocale.txt other.txt ...
 ```

 Starting in ICU 2.2, these added locales are automatically listed by
 `uloc_getAvailable()`.

 ## ICU Data File Formats

 ICU uses several kinds of data files with specific source (plain text) and
 binary data formats. The following lists provides links to descriptions of those
 formats.

 Each ICU data object begins with a header before the actual, specific data. The
 header consists of a 16-bit header length value, the two "magic" bytes DA 27 and
 a [UDataInfo](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/structUDataInfo.html#_details)
 structure which specifies the data object's endianness, charset family, format,
 data version, etc.

 (This is not the case for the trie structures, which are not stand-alone,
 loadable data objects.)

 ### Public Data Files

 #### ICU.dat package files
 *   Source format: (list of files provided as input to the icupkg tool, or
          on the gencmn tool command line)
 *    Binary format: .dat: [source/tools/toolutil/pkg_gencmn.cpp](../../icu4c/source/tools/toolutil/pkg_gencmn.cpp)
 *    Generator tool: [icupkg](../../icu4c/source/tools/icupkg) or
          [gencmn](../../icu4c/source/tools/gencmn)

 #### Resource bundles
 *   Source format: .txt: [icuhtml/design/bnf_rb.txt](https://github.com/unicode-org/icu-docs/blob/master/design/bnf_rb.txt)
 *   Binary format: .res: [source/common/uresdata.h](../../icu4c/source/common/uresdata.h)
 *   Generator tool: [genrb](../../icu4c/source/tools/genrb)

 #### Unicode conversion mapping tables
 *   Source format: .ucm: [Conversion Data chapter](conversion/data.md)
 *   Binary format: .cnv: [source/common/ucnvmbcs.h](../../icu4c/source/common/ucnvmbcs.h)
 *   Generator tool: [makeconv](../../icu4c/source/tools/makeconv)

 #### Conversion (charset) aliases
 *   Source format: [source/data/mappings/convrtrs.txt](../../icu4c/source/data/mappings/convrtrs.txt):
                    contains format description. The command "uconv -l --canon"
                    will also generate the alias table from the currently used
                    copy of ICU.
 *   Binary format: cnvalias.icu: [source/common/ucnv_io.cpp](../../icu4c/source/common/ucnv_io.cpp)
 *   Generator tool: [gencnval](../../icu4c/source/tools/gencnval)

 #### Unicode Character Data (Properties; for Java only: hardcoded in C common library)
 *   Source format: [source/data/unidata/ppucd.txt](../../icu4c/source/data/unidata/ppucd.txt):
                    [Preparsed UCD](http://site.icu-project.org/design/props/ppucd)
 *   Binary format: uprops.icu: [tools/unicode/c/genprops/corepropsbuilder.cpp](../../tools/unicode/c/genprops/corepropsbuilder.cpp)
 *   Generator tool: [genprops](../../tools/unicode/c/genprops)

 #### Unicode Character Data (Case mappings; for Java only: hardcoded in C common library)
 *   Source format: [source/data/unidata/*.txt](../../icu4c/source/data/unidata):
                    [Unicode Character Database](http://www.unicode.org/onlinedat/online.html)
 *   Binary format: ucase.icu: [tools/unicode/c/genprops/casepropsbuilder.cpp](../../tools/unicode/c/genprops/casepropsbuilder.cpp)
 *   Generator tool: [genprops](../../tools/unicode/c/genprops)

 #### Unicode Character Data (BiDi, and Arabic shaping; for Java only: hardcoded in C common library)
 *   Source format: [source/data/unidata/*.txt](../../icu4c/source/data/unidata):
                    [Unicode Character Database](http://www.unicode.org/onlinedat/online.html)
 *   Binary format: ubidi.icu: [tools/unicode/c/genprops/bidipropsbuilder.cpp](../../tools/unicode/c/genprops/bidipropsbuilder.cpp)
 *   Generator tool: [genprops](../../tools/unicode/c/genprops)

 #### Unicode Character Data (Normalization since ICU 4.4) & custom normalization data
 *   Source format: [source/data/unidata/norm2/*.tx](../../icu4c/source/data/unidata/norm2):
                    Files derived from the [Unicode Character Database](http://www.unicode.org/onlinedat/online.html),
                    or custom data.
 *   Binary format: .nrm: [source/common/normalizer2impl.h](../../icu4c/source/common/normalizer2impl.h)
 *   Generator tool: [gennorm2](../../icu4c/source/tools/gennorm2)

 #### Unicode Character Data (Character names)
 *   Source format: [source/data/unidata/UnicodeData.txt](../../icu4c/source/data/unidata/UnicodeData.txt):
                    [Unicode Character Database](http://www.unicode.org/onlinedat/online.html)
 *   Binary format: unames.icu: [tools/unicode/c/genprops/namespropsbuilder.cpp](../../tools/unicode/c/genprops/namespropsbuilder.cpp)
 *   Generator tool: [genprops](../../tools/unicode/c/genprops)

 #### Unicode Character Data (Property [value] aliases since ICU 4.8; for Java only: hardcoded in C common library since ICU 4.8)
 *   Source format: [UCD Property*Aliases.txt](http://www.unicode.org/Public/UNIDATA/):
                    [Unicode Character Database](http://www.unicode.org/onlinedat/online.html)
 *   Binary format: pnames.icu: [source/common/propname.h](../../icu4c/source/common/propname.h)
 *   Generator tool: [genprops](../../tools/unicode/c/genprops)

 #### Unicode Character Data (Text layout properties since ICU 64)
 *   Source format: [source/data/unidata/ppucd.txt](../../icu4c/source/data/unidata/ppucd.txt):
                    [Preparsed UCD](http://site.icu-project.org/design/props/ppucd)
 *   Binary format: ulayout.icu: [tools/unicode/c/genprops/layoutpropsbuilder.cpp](../../tools/unicode/c/genprops/layoutpropsbuilder.cpp)
 *   Generator tool: [genprops](../../tools/unicode/c/genprops)

 #### Collation data (root collation & tailorings; ICU 53 & later)
 *   Source format: Original data from allkeys_CLDR.txt in [CLDR Root Collation Data Files](http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Data_Files)
                    processed into [source/data/unidata/FractionalUCA.txt](../../icu4c/source/data/unidata/FractionalUCA.txt) by
                    [tool at unicode.org maintained by Mark Davis](https://sites.google.com/site/unicodetools/#TOC-UCA)
                    (call the Main class with option writeFractionalUCA);
                    source tailorings (text rules) in [source/data/coll/*.txt](../../icu4c/source/data/coll) resource bundles:
                    [Collation Customization chapter](collation/customization/index.md).
 *   Binary format: ucadata.icu & binary tailorings in resource bundles:
                    [source/i18n/collationdatareader.h](../../icu4c/source/i18n/collationdatareader.h)
 *   Generator tool: [genuca](../../tools/unicode/c/genuca), [genrb](../../icu4c/source/tools/genrb)

 #### Rule-based break iterator data
 *   Source format: .txt: [Boundary Analysis chapter](boundaryanalysis/index.md)
 *   Binary format: .brk: [source/common/rbbidata.h](../../icu4c/source/common/rbbidata.h)
 *   Generator tool: [genbrk](../../icu4c/source/tools/genbrk)

 #### Dictionary-based break iterator data (ICU 50 & later)
 *   Source format: txt: [gendict.cpp comments](../../icu4c/source/tools/gendict/gendict.cpp)
 *   Binary format: .dict: see [source/common/dictionarydata.h](../../icu4c/source/common/dictionarydata.h
 *   Generator tool: [gendict](../../icu4c/source/tools/gendict)

 #### Rule-based transform (transliterator) data
 *   Source format: .txt (in resource bundles): [Transform Rule Tutorial chapter](transforms/general/rules.md)
 *   Binary format: Uses genrb to make binary format
 *   Generator tool: Does not apply

 #### Time zone data (ICU 4.4 & later)
 *   Source format: [source/data/misc/zoneinfo64.txt](../../icu4c/source/data/misc/zoneinfo64.txt):
     ftp://elsie.nci.nih.gov/pub/ tzdata<year><rev>.tar.gz
 *   Binary format: zoneinfo64.res (generated by genrb and [tzcode tools](../../icu4c/source/tools/tzcode/readme.txt)).
 *   Generator tool: Does not apply

 #### StringPrep profile data
 *   Source format: [source/data/sprep/rfc3491.txt](../../icu4c/source/data/sprep/rfc3491.txt):
 *   Binary format: .spp: [source/tools/gensprep/store.c](../../icu4c/source/tools/gensprep/store.c)
 *   Generator tool: [gensprep](../../icu4c/source/tools/gensprep)

 #### Confusables data
 *   Source format: [source/data/unidata/confusables.txt](../../icu4c/source/data/unidata/confusables.txt),
                    [source/data/unidata/confusablesWholeScript.txt](../../icu4c/source/data/unidata/confusablesWholeScript.txt)
 *   Binary format: .spp: [confusables.cfu: source/i18n/uspoof_impl.h](../../icu4c/source/i18n/uspoof_impl.h)
 *   Generator tool: [gencfu](../../icu4c/source/tools/gencfu)

 ### Public Data Files (old versions)

 #### Unicode Character Data (Normalization before ICU 4.4; for Java only: was hardcoded in C common library)
 *   Source format: [source/data/unidata/*.txt]((../../icu4c/source/data/unidata):
                    [Unicode Character Database](http://www.unicode.org/onlinedat/online.html)
 *   Binary format: unorm.icu: [source/common/unormimp.h](../../icu4c/source/common/unormimp.h)
 *   Generator tool: gennorm

 #### Unicode Character Data (Property [value] aliases before ICU 4.8)
 *   Source format: source/data/unidata/Property*Aliases.txt: [Unicode Character Database](http://www.unicode.org/onlinedat/online.html)
 *   Binary format: pnames.icu: source/common/propname.h (ICU 4.6)
 *   Generator tool: genpname

 #### Collation data (UCA, code points to weights; ICU 52 & earlier)
 *   Source format: Same as in ICU 53
 *   Binary format: ucadata.icu & binary tailorings in resource bundles: source/i18n/ucol_imp.h (ICU 52)
 *   Generator tool: [genuca](../../tools/unicode/c/genuca), [genrb](../../icu4c/source/tools/genrb)

 #### Collation data (Inverse UCA, weights->code points; ICU 52 & earlier)
 *   Source format: Processed from FractionalUCA.txt like ICU 52 ucadata.icu
 *   Binary format: invuca.icu: source/i18n/ucol_imp.h (ICU 52)
 *   Generator tool: [genuca](../../tools/unicode/c/genuca)

 #### Dictionary-based break iterator data (ICU 49 & earlier)
 *   Source format: .txt: genctd.cpp comments
 *   Binary format: ctd: see CompactTrieHeader in source/common/triedict.cpp
 *   Generator tool: genctd

 #### Time zone data (Before ICU 4.4)
 *   Source format: .source/data/misc/zoneinfo.txt (ICU 4.2): ftp://elsie.nci.nih.gov/pub/ tzdata<year><rev>.tar.gz
 *   Binary format: zoneinfo64.res (generated by genrb and [tzcode tools](../../icu4c/source/tools/tzcode/readme.txt)).
 *   Generator tool: Does not apply

 ### Non-File API Binary Data

 #### Converter selector data
 *   Source format: none
 *   Binary format: [source/common/ucnvsel.cpp](../../icu4c/source/common/ucnvsel.cpp)
 *   Generator tool: [ucnvsel_open()](../../icu4c/source/common/ucnvsel.cpp)

 ### Test-Only Data Files

 #### test.icu (for udata API testing)
 *   Source format: none (fixed output from gentest when not using -r or -j options)
 *   Binary format: test.icu: see `createData()`
                    in [source/tools/gentest/gentest.c](../../icu4c/source/tools/gentest/gentest.c)
 *   Generator tool: [gentest](../../icu4c/source/tools/gentest/gentest.c)

 ### Other Data Structures

 #### UCPTrie (C)/CodePointTrie (Java) (maps code points to integers)
 *   Source format: (public builder API)
 *   Binary format: [ICU Code Point Tries design doc](http://site.icu-project.org/design/struct/utrie),
                    [icu4c/source/common/ucptrie_impl.h](../../icu4c/source/common/ucptrie_impl.h)
 *   Generator tool: (builder class)

 #### UTrie2 (C)/Trie2 (Java) (maps code points to integers)
 *   Source format: (internal builder API)
 *   Binary format: [ICU Code Point Tries design doc](http://site.icu-project.org/design/struct/utrie),
                    [icu4c/source/common/utrie2_impl.h](../../icu4c/source/common/utrie2_impl.h)
 *   Generator tool: (builder class)

 #### BytesTrie (maps byte sequences to 32-bit integers)
 *   Source format: (public builder API)
 *   Binary format: [BytesTrie design doc](http://site.icu-project.org/design/struct/tries/bytestrie),
                    [icu4c/source/common/unicode/bytestrie.h](../../icu4c/source/common/unicode/bytestrie.h)
 *   Generator tool: (builder class)

 #### UCharsTrie (C++)/CharsTrie (Java) (maps 16-bit-Unicode strings to 32-bit integers)
 *   Source format: (public builder API)
 *   Binary format: [UCharsTrie design doc](http://site.icu-project.org/design/struct/tries/ucharstrie),
                    [icu4c/source/common/unicode/ucharstrie.h](../../icu4c/source/common/unicode/ucharstrie.h)
 *   Generator tool: (builder class)

 ## ICU4J Resource Information

 Starting with release 2.1, ICU4J includes its own resource information which is
 completely independent of the JRE resource information. (Note, ICU4J 2.8 to 3.4,
 time zone information depends on the underlying JRE). The new ICU4J information
 is equivalent to the information in ICU4C and many resources are, in fact, the
 same binary files that ICU4C uses.

 By default the ICU4J distribution includes all of the standard resource
 information. It is located under the directory `com/ibm/icu/impl/data`.
 Depending on the service, the data is in different locations and in different
 formats. Note: This will continue to change from release to release, so clients
 should not depend on the exact organization of the data in ICU4J.

 1.  The primary **locale data** is under the directory icudt38b, as a set of
     ".res" files whose names are the locale identifiers. Locale naming is
     documented in the `com.ibm.icu.util.ULocale` class, and the use of these
     names in     searching for resources is documented in
     `com.ibm.icu.util.UResourceBundle`.

 2.  The **collation data** is under the directory `icudt38b/coll`, as a set of
     ".res" files.

 3.  The **rule-based transliterator data** is under the directory
     `icudt38b/translit` as a set of ".res" files. (**Note:** the Han
     transliterator test data is no longer included in the core icu4j.jar file by
     default.)

 4.  The **rule-based number format data** is under the directory `icudt38b/rbnf`
     as a set of ".res" files.

 5.  The **break iterator data** is directly under the data directory, as a set
     of ".brk" files, named according to the type of break and the locale where
     there are locale-specific versions.

 6.  The **holiday data** is under the data directory, as a set of ".class"
     files, named "HolidayBundle_" followed by the locale ID.

 7.  The **character property data** as well as assorted **normalization data**
     and default **unicode collation algorithm (UCA) data** is found under the
     data directory as a set of ".icu" files.

 8.  The **character set converter data** is under the directory `icudt38b/`, as
     a set of ".cnv" files. These files are currently included only in
     icu-charset.jar.

 9.  The **time zone data** is named `zoneinfo.res` under the directory
     `icudt38b`.

 Some of the data files alias or otherwise reference data from other data files.
 One reason for this is because some locale names have changed. For example,
 he_IL used to be iw_IL. In order to support both names but not duplicate the
 data, one of the resource files refers to the other file's data. In other cases,
 a file may alias a portion of another file's data in order to save space.
 Currently ICU4J provides no tool for revealing these dependencies.

 > :point_right: **Note**: Java's Locale class silently converts the language
 code "he" to "iw" when you construct the Locale (for versions of Java through
 Java 5). Thus Java cannot be used to locate resources that use the "he" language
 code. ICU, on the other hand, does not perform this conversion in ULocale, and
 instead uses aliasing in the locale data to represent the same set of data under
 different locale ids.

 Resource files that use locale ids form a hierarchy, with up to four levels: a
 root, language, region (country), and variant. Searches for locale data attempt
 to match as far down the hierarchy as possible, for example, "he_IL" will match
 he_IL, but "he_US" will match he (since there is no US variant for he, and
 "xx_YY will match root (the default fallback locale) since there is no xx
 language code in the locale hierarchy. Again, see `java.util.ResourceBundle` for
 more information.

 Currently ICU4J provides no tool for revealing these dependencies between data
 files, so trimming the data directly in the ICU4J project is a hit-or-miss
 affair. The key point when you remove data is to make sure to remove all
 dependencies on that data as well. For example, if you remove he.res, you need
 to remove he_IL.res, since it is lower in the hierarchy, and you must remove
 iw.res, since it references he.res, and iw_IL.res, since it depends on it (and
 also references he_IL.res).

 Unfortunately, the jar tool in the JDK provides no way to remove items from a
 jar file. Thus you have to extract the resources, remove the ones you don't
 want, and then create a new jar file with the remaining resources. See the jar
 tool information for how to do this. Before 'rejaring' the files, be sure to
 thoroughly test your application with the remaining resources, making sure each
 required resource is present.

 #### Using additional resource files with ICU4J

 > :point_right: **Note**: Resource file formats can change across releases of ICU4J!
 >
 > *The format of ICU4J resources is not part of the API. Clients who develop their
 > own resources for use with ICU4J should be prepared to regenerate them when they
 > move to new releases of ICU4J.*

 We are still developing ICU4J's resource mechanism. Currently it is not possible
 to mix icu's new binary .res resources with traditional java-style .class or
 .txt resources. We might allow for this in a future release, but since the
 resource data and format is not formally supported, you run the risk of
 incompatibilities with future releases of ICU4J.

 Resource data in ICU4J is checked in to the repository as a jar file containing
 the resource binaries, icudata.jar. This means that inspecting the contents of
 these resources is difficult. They currently are compiled from ICU4C .txt file
 data. You can view the contents of the ICU4C text resource files to understand
 the contents of the ICU4J resources.

 The files in icudata.jar get extracted to com/ibm/icu/impl/data in the build
 directory when the 'core' target is built. Building the 'resources' target will
 force the resources to once again be extracted. Extraction will overwrite any
 corresponding resource files already in that directory.

 ### Building ICU4J Resources from ICU4C

 #### Requirements

 1.  [ICU4C](http://icu-project.org/download/)

 2.  Compilers and tools required for [building ICU4C](https://htmlpreview.github.io/?https://github.com/unicode-org/icu/blob/master/icu4c/readme.html#HowToBuild).

 3.  J2SE SDK version 5 or above

 #### Procedure

 1.  Download and build ICU4C on a Windows or Linux machine. For instructions on downloading and building ICU4C, please click
     [here](https://htmlpreview.github.io/?https://github.com/unicode-org/icu/blob/master/icu4c/readme.html#HowToBuild).

 2.  Follow the remaining instructions in
     [*$icu4c_root*/source/data/icu4j-readme.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/icu4j-readme.txt).
     *$icu4c_root* is the root directory of ICU4C source package.