| --- |
| layout: default |
| title: StringPrep |
| nav_order: 7 |
| parent: Chars and Strings |
| --- |
| <!-- |
| © 2020 and later: Unicode, Inc. and others. |
| License & terms of use: http://www.unicode.org/copyright.html |
| --> |
| |
| # StringPrep |
| |
| ## Overview |
| |
| Comparing strings in a consistent manner becomes imperative when a large |
| repertoire of characters such as Unicode is used in network protocols. |
| StringPrep provides sets of rules for use of Unicode and syntax for prevention |
| of spoofing. The implementation of StringPrep and IDNA services and their usage |
| in ICU is described below. |
| |
| ## StringPrep |
| |
| StringPrep, the process of preparing Unicode strings for use in network |
| protocols is defined in RFC 3454 (<http://www.rfc-editor.org/rfc/rfc3454.txt> ). |
| The RFC defines a broad framework and rules for processing the strings. |
| |
| Protocols that prescribe use of StringPrep must define a profile of StringPrep, |
| whose applicability is limited to the protocol. Profiles are a set of rules and |
| data tables which describe the how the strings should be prepare. The profiles |
| can choose to turn on or turn off normalization, checking for bidirectional |
| characters. They can also choose to add or remove mappings, unassigned and |
| prohibited code points from the tables provided. |
| |
| StringPrep uses Unicode Version 3.2 and defines a set of tables for use by the |
| profiles. The profiles can chose to include or exclude tables or code points |
| from the tables defined by the RFC. |
| |
| StringPrep defines tables that can be broadly classified into |
| |
| 1. *Unassigned Table*: Contains code points that are unassigned in Unicode |
| Version 3.2. Unassigned code points may be allowed or disallowed in the |
| output string depending on the application. The table in Appendix A.1 of the |
| RFC contains the code points. |
| |
| 1. *Mapping Tables*: Code points that are commonly deleted from the output and |
| code points that are case mapped are included in this table. There are two |
| mapping tables in the Appendix namely B.1 and B.2 |
| |
| 2. *Prohibited Tables*: Contains code points that are prohibited from the |
| output string. Control codes, private use area code points, non-character |
| code points, surrogate code points, tagging and deprecated code points are |
| included in this table. There are nine mapping tables in Appendix which |
| include the prohibited code points namely C.1, C.2, C.3, C.4, C.5, C.6, C.7, |
| C.8 and C.9. |
| |
| The procedure for preparing strings for use can be described in the following |
| steps: |
| |
| 1. *Map*: For each code point in the input check if it has a mapping defined in |
| the mapping table, if so, replace it with the mapping in the output. |
| |
| 2. *Normalize*: Normalize the output of step 1 using Unicode Normalization Form |
| NFKC, it the option is set. Normalization algorithm must conform to UAX 15. |
| |
| 3. *Prohibit*: For each code point in the output of step 2 check if the code |
| point is present in the prohibited table, if so, fail returning an error. |
| |
| 4. *Check BiDi*: Check for code points with strong right-to-left directionality |
| in the output of step 3. If present, check if the string satisfies the rules |
| for bidirectional strings as specified. |
| |
| ## NamePrep |
| |
| NamePrep is a profile of StringPrep for use in IDNA. This profile in defined in |
| RFC 3491(<http://www.rfc-editor.org/rfc/rfc3491.txt> ). |
| |
| The profile specifies the following rules: |
| |
| 1. *Map* : Include all code point mappings specified in the StringPrep. |
| |
| 2. *Normalize*: Normalize the output of step 1 according to NFKC. |
| |
| 3. *Prohibit*: Prohibit all code points specified as prohibited in StringPrep |
| except for the space ( U+0020) code point from the output of step 2. |
| |
| 4. *Check BiDi*: Check for bidirectional code points and process according to |
| the rules specified in StringPrep. |
| |
| ## Punycode |
| |
| Punycode is an encoding scheme for Unicode for use in IDNA. Punycode converts |
| Unicode text to unique sequence of ASCII text and back to Unicode. It is an |
| ASCII Compatible Encoding (ACE). Punycode is described in RFC 3492 |
| (<http://www.rfc-editor.org/rfc/rfc3492.txt> ). |
| |
| The Punycode algorithm is a form of a general Bootstring algorithm which allows |
| strings composed of smaller set of code points to uniquely represent any string |
| of code points from a larger set. Punycode represents Unicode code points from |
| U+0000 to U+10FFFF by using the smaller ASCII set U+0000 to U+0007F. The |
| algorithm can also preserve case information of the code points in the lager set |
| while and encoding and decoding. This feature, however, is not used in IDNA. |
| |
| ## Internationalizing Domain Names in Applications (IDNA) |
| |
| The Domain Name Service (DNS) protocol defines the procedure for matching of |
| ASCII strings case insensitively to the names in the lookup tables containing |
| mapping of IP (Internet Protocol) addresses to server names. When Unicode is |
| used instead of ASCII in server names then two problems arise which need to be |
| dealt with differently. When the server name is displayed to the user then |
| Unicode text should be displayed. When Unicode text is stored in lookup tables, |
| for compatibility with older DNS protocol and the resolver libraries, the text |
| should be the ASCII equivalent. The IDNA protocol, defined by RFC 3490 |
| (<http://www.rfc-editor.org/rfc/rfc3490.txt> ), satisfies the above |
| requirements. |
| |
| Server names stored in the DNS lookup tables are usually formed by concatenating |
| domain labels with a label separator, for example: |
| |
| The protocol defines operations to be performed on domain labels before the |
| names are stored in the lookup tables and before the names fetched from lookup |
| tables are displayed to the user. The operations are : |
| |
| 1. ToASCII: This operation is performed on domain labels before sending the |
| name to a resolver and before storing the name in the DNS lookup table. The |
| domain labels are processed by StringPrep algorithm by using the rules |
| specified by NamePrep profile. The output of this step is then encoded by |
| using Punycode and an ACE prefix is added to denote that the text is encoded |
| using Punycode. IDNA uses “xn--” before the encoded label. |
| |
| 1. ToUnicode: This operation is performed on domain labels before displaying |
| the names to to users. If the domain label is prefixed with the ACE prefix |
| for IDNA, then the label excluding the prefix is decoded using Punycode. The |
| output of Punycode decoder is verified by applying ToASCII operation and |
| comparing the output with the input to the ToUnicode operation. |
| |
| Unicode contains code points that are glyphically similar to the ASCII Full Stop |
| (U+002E). These code points must be treated as label separators when performing |
| ToASCII operation. These code points are : |
| |
| 1. Ideographic Full Stop (U+3002) |
| |
| 2. Full Width Full Stop (U+FF0E) |
| |
| 3. Half Width Ideographic Full Stop (U+FF61) |
| |
| Unassigned code points in Unicode Version 3.2 as given in StringPrep tables are |
| treated differently depending on how the processed string is used. For query |
| operations, where a registrar is requested for information regarding |
| availability of a certain domain name, unassigned code points are allowed to be |
| present in the string. For storing the string in DNS lookup tables, unassigned |
| code points are prohibited from the input. |
| |
| IDNA specifies that the ToUnicode and ToASCII have options to check for |
| Letter-Digit-Hyphen code points and adhere to the STD3 ASCII Rules. |
| |
| IDNA specifies that domain labels are equivalent if and only if the output of |
| ToASCII operation on the labels match using case insensitive ASCII comparison. |
| |
| ## StringPrep Service in ICU |
| |
| The StringPrep service in ICU is data driven. The service is based on |
| Open-Use-Close pattern. A StringPrep profile is opened, the strings are |
| processed according to the rules specified in the profile and the profile is |
| closed once the profile is ready to be disposed. |
| |
| Tools for filtering RFC 3454 and producing a rule file that can be compiled into |
| a binary format containing all the information required by the service are |
| provided. |
| |
| The procedure for producing a StringPrep profile data file are as given below: |
| |
| 1. Run filterRFC3454.pl Perl tool, to filter the RFC file and produce a rule |
| file. The text file produced can be edited by the clients to add/delete |
| mappings or add/delete prohibited code points. |
| |
| 2. Run the gensprep tool to compile the rule file into a binary format. The |
| options to turn on normalization of strings and checking of bidirectional |
| code points are passed as command line options to the tool. This tool |
| produces a binary profile file with the extension “spp”. |
| |
| 3. Open the StringPrep profile with path to the binary and name of the binary |
| profile file as the options to the open call. The profile data files are |
| memory mapped and cached for optimum performance. |
| |
| ### Code Snippets |
| |
| > :point_right: **Note**: The code snippets demonstrate the usage of the APIs. Applications should |
| keep the profile object around for reuse, instead of opening and closing the |
| profile each time.* |
| |
| #### C++ |
| |
| UErrorCode status = U_ZERO_ERROR; |
| UParseError parseError; |
| /* open the StringPrep profile */ |
| UStringPrepProfile* nameprep = usprep_open("/usr/joe/mydata", |
| "nfscsi", &status); |
| if(U_FAILURE(status)) { |
| /* handle the error */ |
| } |
| /* prepare the string for use according |
| * to the rules specified in the profile |
| */ |
| int32_t retLen = usprep_prepare(src, srcLength, dest, |
| destCapacity, USPREP_ALLOW_UNASSIGNED, |
| nameprep, &parseError, &status); |
| /* close the profile */ |
| usprep_close(nameprep); |
| |
| #### Java |
| |
| private static final StringPrep nfscsi = null; |
| //singleton instance |
| private static final NFSCSIStringPrep prep=new NFSCSIStringPrep(); |
| private NFSCSIStringPrep() { |
| try { |
| InputStream nfscsiFile = TestUtil.getDataStream("nfscsi.spp"); |
| nfscsi = new StringPrep(nfscsiFile); |
| nfscsiFile.close(); |
| } catch(IOException e) { |
| throw new RuntimeException(e.toString()); |
| } |
| } |
| private static byte[] prepare(byte[] src, StringPrep prep) |
| throws StringPrepParseException, UnsupportedEncodingException { |
| String s = new String(src, "UTF-8"); |
| UCharacterIterator iter = UCharacterIterator.getInstance(s); |
| StringBuffer out = prep.prepare(iter,StringPrep.DEFAULT); |
| return out.toString().getBytes("UTF-8"); |
| } |
| |
| ## IDNA API in ICU |
| |
| ICU provides APIs for performing the ToASCII, ToUnicode and compare operations |
| as defined by the RFC 3490. Convenience methods for comparing IDNs are also |
| provided. These APIs follow ICU policies for string manipulation and coding |
| guidelines. |
| |
| ### Code Snippets |
| |
| > :point_right: **Note**: The code snippets demonstrate the usage of the APIs. Applications should |
| keep the profile object around for reuse, instead of opening and closing the |
| profile each time.* |
| |
| ### ToASCII operation |
| |
| ***C*** |
| |
| UChar* dest = (UChar*) malloc(destCapacity * U_SIZEOF_UCHAR); |
| destLen = uidna_toASCII(src, srcLen, dest, destCapacity, |
| UIDNA_DEFAULT, &parseError, &status); |
| if(status == U_BUFFER_OVERFLOW_ERROR) { |
| status = U_ZERO_ERROR; |
| destCapacity= destLen + 1; /* for the terminating Null */ |
| free(dest); /* free the memory */ |
| dest = (UChar*) malloc(destLen * U_SIZEOF_UCHAR); |
| destLen = uidna_toASCII(src, srcLen, dest, destCapacity, |
| UIDNA_DEFAULT, &parseError, &status); |
| } |
| if(U_FAILURE(status)) { |
| /* handle the error */ |
| } |
| /* do interesting stuff with output*/ |
| |
| ***Java*** |
| |
| try { |
| StringBuffer out= IDNA.convertToASCII(inBuf,IDNA.DEFAULT); |
| } catch(StringPrepParseException ex) { |
| /*handle the exception*/ |
| } |
| |
| ### toUnicode operation |
| |
| ***C*** |
| |
| UChar * dest = (UChar *) malloc(destCapacity * U_SIZEOF_UCHAR); |
| destLen = uidna_toUnicode(src, srcLen, dest, destCapacity, |
| UIDNA_DEFAULT |
| &parseError, &status); |
| if(status == U_BUFFER_OVERFLOW_ERROR) { |
| status = U_ZERO_ERROR; |
| destCapacity= destLen + 1; /* for the terminating Null */ |
| /* free the memory */ |
| free(dest); |
| dest = (UChar*) malloc(destLen * U_SIZEOF_UCHAR); |
| destLen = uidna_toUnicode(src, srcLen, dest, destCapacity, |
| UIDNA_DEFAULT, &parseError, &status); |
| } |
| if(U_FAILURE(status)) { |
| /* handle the error */ |
| } |
| /* do interesting stuff with output*/ |
| |
| ***Java*** |
| |
| try { |
| StringBuffer out= IDNA.convertToUnicode(inBuf,IDNA.DEFAULT); |
| } catch(StringPrepParseException ex) { |
| // handle the exception |
| } |
| |
| ### compare operation |
| |
| ***C*** |
| |
| int32_t rc = uidna_compare(source1, length1, |
| source2, length2, |
| UIDNA_DEFAULT, |
| &status); |
| if(rc==0) { |
| /* the IDNs are same ... do something interesting */ |
| } else { |
| /* the IDNs are different ... do something */ |
| } |
| |
| ***Java*** |
| |
| try { |
| int retVal = IDNA.compare(s1,s2,IDNA.DEFAULT); |
| // do something interesting with retVal |
| } catch(StringPrepParseException e) { |
| // handle the exception |
| } |
| |
| ## Design Considerations |
| |
| StringPrep profiles exhibit the following characteristics: |
| |
| 1. The profiles contain information about code points. StringPrep allows |
| profiles to add/delete code points or mappings. |
| |
| 2. Options such as turning normalization and checking for bidirectional code |
| points on or off are the properties of the profiles. |
| |
| 3. The StringPrep algorithm is not overridden by the profile. |
| |
| 4. Once defined, the profiles do not change. |
| |
| The StringPrep profiles are used in network protocols so runtime performance is |
| important. |
| |
| Many profiles have been and are being defined, so applications should be able to |
| plug-in arbitrary profiles and get the desired result out of the framework. |
| |
| ICU is designed for this usage by providing build-time tools for arbitrary |
| StringPrep profile definitions, and loading them from application-supplied data |
| in binary form with data structures optimized for runtime use. |
| |
| ## Demo |
| |
| A web application at <https://icu4c-demos.unicode.org/icu-bin/idnbrowser> |
| illustrates the use of IDNA API. The source code for the application is |
| available at <https://github.com/unicode-org/icu-demos/tree/main/idnbrowser>. |
| |
| ## Appendix |
| |
| #### NFS Version 4 Profiles |
| |
| Network File System Version 4 defined by RFC 3530 |
| (<http://www.rfc-editor.org/rfc/rfc3530.txt> ) defines use of Unicode text in |
| the protocol. ICU provides the requisite profiles as part of test suite and code |
| for processing the strings according the profiles as a part of samples. |
| |
| The RFC defines three profiles : |
| |
| 1. *nfs4_cs_prep Profile*: This profile is used for preparing file and path |
| name strings. Normalization of code points and checking for bidirectional |
| code points are turned off. Case mappings are included if the NFS |
| implementation supports case insensitive file and path names. |
| |
| 2. *nfs4_cis_prep Profile*: This profile is used for preparing NFS server |
| names. Normalization of code points and checking for bidirectional code |
| points are turned on. This profile is equivalent to NamePrep profile. |
| |
| 3. *nfs4_mixed_prep Profile*: This profile is used for preparing strings in the |
| Access Control Entries of NFS servers. These strings consist of two parts, |
| prefix and suffix, separated by '@' (U+0040). The prefix is processed with |
| case mappings turned off and the suffix is processed with case mappings |
| turned on. Normalization of code points and checking for bidirectional code |
| points are turned on. |
| |
| #### XMPP Profiles |
| |
| Extensible Messaging and Presence Protocol (XMPP) is an XML based protocol for |
| near real-time extensible messaging and presence. This protocol defines use of |
| two StringPrep profiles: |
| |
| 1. *ResourcePrep Profile*: This profile is used for processing the resource |
| identifiers within XMPP. Normalization of code points and checking of |
| bidirectional code points are turned on. Case mappings are excluded. The |
| space code point (U+0020) is excluded from the prohibited code points set. |
| |
| 2. *NodePrep Profile*: This profile is used for processing the node identifiers |
| within XMPP. Normalization of code points and checking of bidirectional code |
| points are turned on. Case mappings are included. All code points specified |
| as prohibited in StringPrep are prohibited. Additional code points are added |
| to the prohibited set. |