docs/userguide/icu_data/buildtool.md - external/github.com/unicode-org/icu - Git at Google

 <!--
 © 2019 and later: Unicode, Inc. and others.
 License & terms of use: http://www.unicode.org/copyright.html
 -->

 ICU Data Build Tool
 ===================

 ICU 64 provides a tool for configuring your ICU locale data file with finer
 granularity.  This page explains how to use this tool to customize and reduce
 your data file size.

 ## Overview: What is in the ICU data file?

 There are hundreds of **locales** supported in ICU (including script and
 region variants), and ICU supports many different **features**.  For each
 locale and for each feature, data is stored in one or more data files.

 Those data files are compiled and then bundled into a `.dat` file called
 something like `icudt64l.dat`, which is little-endian data for ICU 64. This
 dat file is packaged into the `libicudata.so` on Linux or `libicudata.dll.a`
 on Windows. In ICU4J, it is bundled into a jar file named `icudata.jar`.

 At a high level, the size of the ICU data file corresponds to the
 cross-product of locales and features, except that not all features require
 locale-specific data, and not all locales require data for all features. The
 data file contents can be approximately visualized like this:

 <img alt="Features vs. Locales" src="../assets/features_locales.svg" style="max-width:600px" />

 The `icudt64l.dat` file is 27 MiB uncompressed and 11 MiB gzipped.  This file
 size is too large for certain use cases, such as bundling the data file into a
 smartphone app or an embedded device.  This is something the ICU Data Build
 Tool aims to solve.

 ## ICU Data Configuration File

 The ICU Data Build Tool enables you to write a configuration file that
 specifies what features and locales to include in a custom data bundle.

 The configuration file may be written in either [JSON](http://json.org/) or
 [Hjson](https://hjson.org/).  To build ICU4C with custom data, set the
 `ICU_DATA_FILTER_FILE` environment variable when running `runConfigureICU` on
 Unix or when building the data package on Windows.  For example:

     ICU_DATA_FILTER_FILE=filters.json path/to/icu4c/source/runConfigureICU Linux

 You must have the data sources in order to use the ICU Data Build Tool.
 Check for the file icu4c/source/data/locales/root.txt. If that file is
 missing, you need to download "icu4c-*-data.zip" and replace the contents of
 icu4c/source/data with the data directory from the zip file.

 In order to use Hjson syntax, the `hjson` pip module must be installed on
 your system.  You should also consider installing the `jsonschema` module to
 print messages when errors are found in your config file.

     $ pip3 install --user hjson jsonschema

 To build ICU4J with custom data, you must first build ICU4C with custom data
 and then generate the JAR file.  For more information, read
 [icu4j-readme.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/icu4j-readme.txt).

 ### Locale Slicing

 The simplest way to slice ICU data is by locale.  The ICU Data Build Tool
 makes it easy to select your desired locales to suit a number of use cases.

 #### Filtering by Language Only

 Here is a *filters.json* file that builds ICU data with support for English,
 Chinese, and German, including *all* script and regional variants for those
 languages:

     {
       "localeFilter": {
         "filterType": "language",
         "whitelist": [
           "en",
           "de",
           "zh"
         ]
       }
     }

 The *filterType* "language" only supports slicing by entire languages.

 #### Filtering by Locale

 For more control, use *filterType* "locale".  Here is a *filters.hjson* file that
 includes the same three languages as above, including regional variants, but
 only the default script (e.g., Simplified Han for Chinese):

     localeFilter: {
       filterType: locale
       whitelist: [
         en
         de
         zh
       ]
     }

 #### Adding Script Variants (includeScripts = true)

 You may set the *includeScripts* option to true to include all scripts for a
 language while using *filterType* "locale".  This results in behavior similar
 to *filterType* "language".  In the following JSON example, all scripts for
 Chinese are included:

     {
       "localeFilter": {
         "filterType": "locale",
         "includeScripts": true,
         "whitelist": [
           "en",
           "de",
           "zh"
         ]
       }
     }

 If you wish to explicitly list the scripts, you may put the script code in the
 locale tag in the whitelist, and you do not need the *includeScripts* option
 enabled.  For example, in Hjson, to include Han Traditional ***but not Han
 Simplified***:

     localeFilter: {
       filterType: locale
       whitelist: [
         en
         de
         zh_Hant
       ]
     }

 Note: the option *includeScripts* is only supported at the language level;
 i.e., in order to include all scripts for a particular language, you must
 specify the language alone, without a region tag.

 #### Removing Regional Variants (includeChildren = false)

 If you wish to enumerate exactly which regional variants you wish to support,
 you may use *filterType* "locale" with the *includeChildren* setting turned to
 false.  The following *filters.hjson* file includes English (US), English
 (UK), German (Germany), and Chinese (China, Han Simplified), as well as their
 dependencies, *but not* other regional variants like English (Australia),
 German (Switzerland), or Chinese (Taiwan, Han Traditional):

     localeFilter: {
       filterType: locale
       includeChildren: false
       whitelist: [
         en_US
         en_GB
         de_DE
         zh_CN
       ]
     }

 Including dependencies, the above filter would include the following data files:

 - root.txt
 - en.txt
 - en_US.txt
 - en_001.txt
 - en_GB.txt
 - de.txt
 - de_DE.txt
 - zh.txt
 - zh_Hans.txt
 - zh_Hans_CN.txt
 - zh_CN.txt

 ### File Slicing (coarse-grained features)

 ICU provides a lot of features, of which you probably need only a small subset
 for your application.  Feature slicing is a powerful way to prune out data for
 any features you are not using.

 ***CAUTION:*** When slicing by features, you must manually include all
 dependencies.  For example, if you are formatting dates, you must include not
 only the date formatting data but also the number formatting data, since dates
 contain numbers.  Expect to spend a fair bit of time debugging your feature
 filter to get it to work the way you expect it to.

 The data for many ICU features live in individual files.  The ICU Data Build
 Tool puts puts similar *types* of files into categories.  The following table
 summarizes the ICU data files and their corresponding features and categories:

 | Feature | Category ID(s) | Data Files <br/> ([icu4c/source/data](https://github.com/unicode-org/icu/tree/master/icu4c/source/data)) | Resource Size <br/> (as of ICU 64) |
 |---|---|---|---|
 | Break Iteration | `"brkitr_rules"` <br/> `"brkitr_dictionaries"` <br/> `"brkitr_tree"` | brkitr/rules/\*.txt <br/> brkitr/dictionaries/\*.txt <br/> brkitr/\*.txt | 522 KiB <br/> **2.8 MiB** <br/> 14 KiB |
 | Charset Conversion | `"conversion_mappings"` | mappings/\*.ucm | **4.9 MiB** |
 | Collation <br/> *[more info](#collation-ucadata)* | `"coll_ucadata"` <br/> `"coll_tree"` | in/coll/ucadata-\*.icu <br/> coll/\*.txt | 511 KiB <br/> **2.8 MiB** |
 | Confusables | `"confusables"` | unidata/confusables\*.txt | 45 KiB |
 | Currencies | `"misc"` <br/> `"curr_supplemental"` <br/> `"curr_tree"` | misc/currencyNumericCodes.txt <br/> curr/supplementalData.txt <br/> curr/\*.txt | 3.1 KiB <br/> 27 KiB <br/> **2.5 MiB** |
 | Language Display <br/> Names | `"lang_tree"` | lang/\*.txt | **2.1 MiB** |
 | Language Tags | `"misc"` | misc/keyTypeData.txt <br/> misc/langInfo.txt <br/> misc/likelySubtags.txt <br/> misc/metadata.txt | 6.8 KiB <br/> 37 KiB <br/> 53 KiB <br/> 33 KiB |
 | Normalization | `"normalization"` | in/\*.nrm except in/nfc.nrm | 160 KiB |
 | Plural Rules | `"misc"` | misc/pluralRanges.txt <br/> misc/plurals.txt | 3.3 KiB <br/> 33 KiB |
 | Region Display <br/> Names | `"region_tree"` | region/\*.txt | **1.1 MiB** |
 | Rule-Based <br/> Number Formatting <br/> (Spellout, Ordinals) | `"rbnf_tree"` | rbnf/\*.txt | 538 KiB |
 | StringPrep | `"stringprep"` | sprep/\*.txt | 193 KiB |
 | Time Zones | `"misc"` <br/> `"zone_tree"` | misc/metaZones.txt <br/> misc/timezoneTypes.txt <br/> misc/windowsZones.txt <br/> misc/zoneinfo64.txt <br/> zone/\*.txt | 41 KiB <br/> 20 KiB <br/> 22 KiB <br/> 151 KiB <br/> **2.7 MiB** |
 | Transliteration | `"translit"` | translit/\*.txt | 685 KiB |
 | Unicode Character <br/> Names | `"unames"` | in/unames.icu | 269 KiB |
 | Unicode Text Layout | `"ulayout"` | in/ulayout.icu | 14 KiB |
 | Units | `"unit_tree"` | unit/\*.txt | **1.7 MiB** |
 | **OTHER** | `"cnvalias"` <br/> `"misc"` <br/> `"locales_tree"` | mappings/convrtrs.txt <br/> misc/dayPeriods.txt <br/> misc/genderList.txt <br/> misc/numberingSystems.txt <br/> misc/supplementalData.txt <br/> locales/\*.txt | 63 KiB <br/> 19 KiB <br/> 0.5 KiB <br/> 5.6 KiB <br/> 228 KiB <br/> **2.4 MiB** |

 #### Additive and Subtractive Modes

 The ICU Data Build Tool allows two strategies for selecting features:
 *additive* mode and *subtractive* mode.

 The default is to use subtractive mode. This means that all ICU data is
 included, and your configurations can remove or change data from that baseline.
 Additive mode means that you start with an *empty* ICU data file, and you must
 explicitly add the data required for your application.

 There are two concrete differences between additive and subtractive mode:

 |                         | Additive    | Subtractive |
 |-------------------------|-------------|-------------|
 | Default Feature Filter  | `"exclude"` | `"include"` |
 | Default Resource Filter | `"-/"`      | `"+/"`      |

 To enable additive mode, add the following setting to your filter file:

     strategy: "additive"

 #### Filter Types

 You may list *filters* for each category in the *featureFilters* section of
 your config file.  What follows are examples of the possible types of filters.

 ##### Inclusion Filter

 To include a category, use the string `"include"` as your filter.

     featureFilters: {
       locales_tree: include
     }

 If the category is a locale tree (ends with `_tree`), the inclusion filter
 resolves to the `localeFilter`; for more information, see the section
 "Locale-Tree Categories." Otherwise, the inclusion filter causes all files in
 the category to be included.

 **NOTE:** When subtractive mode is used (default), all categories implicitly
 start with `"include"` as their filter.

 ##### Exclusion Filter

 To exclude an entire category, use *filterType* "exclude".  For example, to
 exclude all confusables data:

     featureFilters: {
       confusables: {
         filterType: exclude
       }
     }

 Since ICU 65, you can also write simply:

     featureFilters: {
       confusables: exclude
     }

 **NOTE:** When additive mode is used, all categories implicitly start with
 `"exclude"` as their filter.

 ##### File Name Filter

 To exclude certain files out of a category, use the file name filter, which is
 the default type of filter when *filterType* is not specified.  For example,
 to include the Burmese break iteration dictionary but not any other
 dictionaries:

     featureFilters: {
       brkitr_dictionaries: {
         whitelist: [
           burmesedict
         ]
       }
     }

 Do *not* include directories or file extensions.  They will be added
 automatically for you.  Note that all files in a particular category have the
 same directory and extension.

 You can use either a whitelist or a blacklist for the file name filter.

 ##### Regex Filter

 To exclude filenames matching a certain regular expression, use *filterType*
 "regex".  For example, to reject the CJK-specific break iteration rules:

     featureFilters: {
       brkitr_rules: {
         filterType: regex
         blacklist: [
           ^.*_cj$
         ]
       }
     }

 The Python standard library [*re*
 module](https://docs.python.org/3/library/re.html) is used for evaluating the
 regular expressions.  In case the regular expression engine is changed in the
 future, however, you are encouraged to restrict yourself to a simple set of
 regular expression operators.

 As above, do not include directories or file extensions, and you can use
 either a whitelist or a blacklist.

 ##### Union Filter

 You can combine the results of multiple filters with *filterType* "union".
 This filter matches files that match *at least one* of the provided filters.
 The syntax is:

     {
       filterType: union
       unionOf: [
         { /* filter 1 */ },
         { /* filter 2 */ },
         // ...
       ]
     }

 This filter type is useful for combining "locale" filters with different
 includeScripts or includeChildren options.

 #### Locale-Tree Categories

 Several categories have the `_tree` suffix.  These categories are for "locale
 trees": they contain locale-specific data.  ***The [localeFilter configuration
 option](#slicing-data-by-locale) sets the default file filter for all `_tree`
 categories.***

 If you want to include different locales for different locale file trees, you
 can override their filter in the *featureFilters* section of the config file.
 For example, to include only Italian data for currency symbols *instead of*
 the common locales specified in *localeFilter*, you can do the following:

     featureFilters:
       curr_tree: {
         filterType: locale
         whitelist: [
           it
         ]
       }
     }

 You can exclude an entire `_tree` category without affecting other categories.
 For example, to exclude region display names:

     featureFilters: {
       region_tree: {
         filterType: exclude
       }
     }

 Note that you are able to use any of the other filter types for `_tree`
 categories, but you must be very careful that you are including all of the
 correct files.  For example, `en_GB` requires `en_001`, and you must always
 include `root`.  If you use the "language" or "locale" filter types, this
 logic is done for you.

 ### Resource Bundle Slicing (fine-grained features)

 The third section of the ICU filter config file is *resourceFilters*.  With
 this section, you can dive inside resource bundle files to remove even more
 data.

 You can apply resource filters to all locale tree categories as well as to
 categories that include resource bundles, such as the `"misc"` category.

 For example, consider measurement units.  There is one unit file per locale (example:
 [en.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unit/en.txt)),
 and that file contains data for all measurement units in CLDR.  However, if
 you are only formatting distances, for example, you may need the data for only
 a small set of units.

 Here is how you could include units of length in the "short" style but no
 other units:

     resourceFilters: [
       {
         categories: [
           unit_tree
         ]
         rules: [
           -/units
           -/unitsNarrow
           -/unitsShort
           +/unitsShort/length
         ]
       }
     ]

 Conceptually, the rules are applied from top to bottom.  First, all data for
 all three styes of units are removed, and then the short length units are
 added back.

 **NOTE:** In subtractive mode, resource paths are *included* by default. In
 additive mode, resource paths are *excluded* by default.

 #### Wildcard Character

 You can use the wildcard character (`*`) to match a piece of the resource
 path.  For example, to include length units for all three styles, you can do:

     resourceFilters: [
       {
         categories: [
           unit_tree
         ]
         rules: [
           -/units
           -/unitsNarrow
           -/unitsShort
           +/*/length
         ]
       }
     ]

 The wildcard must be the only character in its path segment. Future ICU
 versions may expand the syntax.

 #### Resource Filter for Specific File

 The resource filter object takes an optional *files* setting which accepts a
 file filter in the same syntax used above for file filtering.  For example, if
 you wanted to apply a filter to misc/supplementalData.txt, you could do the
 following (this example removes calendar data):

     resourceFilters: [
       {
         categories: ["misc"]
         files: {
           whitelist: ["supplementalData"]
         }
         rules: [
           -/calendarData
         ]
       }
     ]

 #### Combining Multiple Resource Filter Specs

 You can also list multiple resource filter objects in the *resourceFilters*
 array; the filters are added from top to bottom.  For example, here is an
 advanced configuration that includes "mile" for en-US and "kilometer" for
 en-CA; this also makes use of the *files* option:

     resourceFilters: [
       {
         categories: ["unit_tree"]
         rules: [
           -/units
           -/unitsNarrow
           -/unitsShort
         ]
       },
       {
         categories: ["unit_tree"]
         files: {
           filterType: locale
           whitelist: ["en_US"]
         }
         rules: [
           +/*/length/mile
         ]
       },
       {
         categories: ["unit_tree"]
         files: {
           filterType: locale
           whitelist: ["en_CA"]
         }
         rules: [
           +/*/length/kilometer
         ]
       }
     ]

 The above example would give en-US these resource filter rules:

     -/units
     -/unitsNarrow
     -/unitsShort
     +/*/length/mile

 and en-CA these resource filter rules:

     -/units
     -/unitsNarrow
     -/unitsShort
     +/*/length/kilometer

 In accordance with *filterType* "locale", the parent locales *en* and *root*
 would get both units; this is required since both en-US and en-CA may inherit
 from the parent locale:

     -/units
     -/unitsNarrow
     -/unitsShort
     +/*/length/mile
     +/*/length/kilometer

 ## Debugging Tips

 **Run Python directly:** If you do not want to wait for ./runConfigureICU to
 finish, you can directly re-generate the rules using your filter file with the
 following command line run from *iuc4c/source*.

     $ PYTHONPATH=python python3 -m icutools.databuilder \
       --mode=gnumake --src_dir=data > data/rules.mk

 **Install jsonschema:** Install the `jsonschema` pip package to get warnings
 about problems with your filter file.

 **See what data is being used:** ICU is instrumented to allow you to trace
 which resources are used at runtime. This can help you determine what data you
 need to include. For more information, see [tracing.md](tracing.md).

 **Inspect data/rules.mk:** The Python script outputs the file *rules.mk*
 inside *iuc4c/source/data*. To see what is going to get built, you can inspect
 that file. First build ICU normally, and copy *rules.mk* to
 *rules_default.mk*. Then build ICU with your filter file. Now you can take the
 diff between *rules_default.mk* and *rules.mk* to see exactly what your filter
 file is removing.

 **Inspect the output:** After a `make clean` and `make` with a new *rules.mk*,
 you can look inside the directory *icu4c/source/data/out* to see the files
 that got built.

 **Inspect the compiled resource filter rules:** If you are using a resource
 filter, the resource filter rules get compiled for each individual locale
 inside *icu4c/source/data/out/tmp/filters*. You can look at those files to see
 what filter rules are being applied to each individual locale.

 **Run genrb in verbose mode:** For debugging a resource filter, you can run
 genrb in verbose mode to see which resources got stripped. To do this, first
 inspect the make output and find a command line like this:

     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/genrb --filterDir ./out/tmp/filters/unit_tree -s ./unit -d ./out/build/icudt64l/unit/ -i ./out/build/icudt64l --usePoolBundle ./out/build/icudt64l/unit/ -k en.txt

 Copy that command line and re-run it from *icu4c/source/data* with the `-v`
 flag added to the end. The command will print out exactly which resource paths
 are being included and excluded as well as a model of the filter rules applied
 to this file.

 **Inspect .res files with derb:** The `derb` tool can convert .res files back
 to .txt files after filtering. For example, to convert the above unit res file
 back to a txt file, you can run this command from *icu4c/source*:

     LD_LIBRARY_PATH=lib bin/derb data/out/build/icudt64l/unit/en.res

 That will produce a file *en.txt* in your current directory, which is the
 original *data/unit/en.txt* but after resource filters were applied.

 *Tip:* derb expects your res files to be rooted in a directory named
 `icudt64l` (corresponding to your current ICU version and endianness). If your
 files are not in such a directory, derb fails with U_MISSING_RESOURCE_ERROR.

 **Put complex rules first** and **use the wildcard `*` sparingly:** The order
 of the filter rules matters a great deal in how effective your data size
 reduction can be, and the wildcard `*` can sometimes produce behavior that is
 tricky to reason about. For example, these three lists of filter rules look
 similar on first glance but acutally produce different output:

 <table>
 <tr>
 <th>Unit Resource Filter Rules</th>
 <th>Unit Resource Size</th>
 <th>Commentary</th>
 <th>Result</th>
 </tr>
 <tr><td><pre>
 -/*/*
 +/*/digital
 -/*/digital/*/dnam
 -/durationUnits
 -/units
 -/unitsNarrow
 </pre></td><td>77 KiB</td><td>
 First, remove all unit types. Then, add back digital units across all unit
 widths. Then, remove display names from digital units. Then, remove duration
 unit patterns and long and narrow forms.
 </td><td>
 Digital units in short form are included; all other units are removed.
 </td></tr>
 <tr><td><pre>
 -/durationUnits
 -/units
 -/unitsNarrow
 -/*/*
 +/*/digital
 -/*/digital/*/dnam
 </pre></td><td>125 KiB</td><td>
 First, remove duration unit patterns and long and narrow forms. Then, remove
 all unit types. Then, add back digital units across all unit widths. Then,
 remove display names from digital units.
 </td><td>
 Digital units are included <em>in all widths</em>; all other units are removed.
 </td></tr>
 <tr><td><pre>
 -/*/*
 +/*/digital
 -/*/*/*/dnam
 -/durationUnits
 -/units
 -/unitsNarrow
 </pre></td><td>191 KiB</td><td>
 First, remove all unit types. Then, add back digital units across all unit
 widths. Then, remove display names from all units. Then, remove duration unit
 patterns and long and narrow forms.
 </td><td>
 Digital units in short form are included, as is the <em>tree structure</em>
 for all other units, even though the other units have no real data.
 </td></tr>
 </table>

 By design, empty tree structure is retained in the unit bundle. This is
 because there are numerous instances in ICU data where the presence of an
 empty tree carries meaning. However, it means that you must be careful when
 building resource filter rules in order to achieve the optimal data bundle
 size.

 Using the `-v` option in genrb (described above) is helpful when debugging
 these types of issues.

 ## Other Features of the ICU Data Build Tool

 While data filtering is the primary reason the ICU Data Build Tool was
 developed, there are there are additional use cases.

 ### Running Data Build without Configure/Make

 You can build the dat file outside of the ICU build system by directly
 invoking the Python icutools.databuilder.  Run the following command to see the
 help text for the CLI tool:

     $ PYTHONPATH=path/to/icu4c/source/python python3 -m icutools.databuilder --help

 ### Collation UCAData

 For using collation (sorting and searching) in any language, the "root"
 collation data file must be included. It provides the Unicode CLDR default
 sort order for all code points, and forms the basis for language-specific
 tailorings as well as for custom collators built at runtime.

 There are two versions of the root collation data file:

 - ucadata-unihan.txt (compiled size: 511 KiB)
 - ucadata-implicithan.txt (compiled size: 178 KiB)

 The unihan version sorts Han characters in radical-stroke order according to
 Unicode, which is a somewhat useful default sort order, especially for use
 with non-CJK languages.  The implicithan version sorts Han characters in the
 order of their Unicode assignment, which is similar to radical-stroke order
 for common characters but arbitrary for others.  For more information, see
 [UTS #10 §10.1.3](https://www.unicode.org/reports/tr10/#Implicit_Weights).

 By default, the unihan version is used.  The unihan version of the data file
 is much larger than that for implicithan, so if you need collation but also
 small data, then you may want to select the implicithan version.  To use the
 implicithan version, put the following setting in your *filters.json* file:

     {
       "collationUCAData": "implicithan"
     }

 ### Disable Pool Bundle

 By default, ICU uses a "pool bundle" to store strings shared between locales.
 This saves space and is recommended for most users. However, when developing
 a system where locale data files may be added "on the fly" and not included in
 the original ICU distribution, those additional data files may not be able to
 use a pool bundle due to name collisions with the existing pool bundle.

 To disable the pool bundle in the current ICU build, put the following setting
 in your *filters.json* file:

     {
       "usePoolBundle": false
     }

 ### File Substitution

 Using the configuration file, you can perform whole-file substitutions.  For
 example, suppose you want to replace the transliteration rules for
 *Zawgyi_my*.  You could create a directory called `my_icu_substitutions`
 containing your new `Zawgyi_my.txt` rule file, and then put this in your
 configuration file:

     fileReplacements: {
       directory: "/path/to/my_icu_substitutions"
       replacements: [
         {
           src: "Zawgyi_my.txt"
           dest: "translit/Zawgyi_my.txt"
         },
         "misc/dayPeriods.txt"
       ]
     }

 `directory` should either be an absolute path, or a path starting with one of
 the following, and it should not contain a trailing slash:

 - "$SRC" for the *icu4c/source/data* directory in the source tree
 - "$FILTERS" for the directory containing filters.json
 - "$CWD" for your current working directory

 When the entry in the `replacements` array is an object, the `src` and `dest`
 fields indicate, for each file in the source directory (`src`), what file in
 the ICU hierarchy it should replace (`dest`). When the entry is a string, the
 same relative path is used for both `src` and `dest`.

 Whole-file substitution happens before all other filters are applied.