blob: 4d7c4aa0ed6bebfc0721d77ecf2adecc4df63c4f [file] [log] [blame] [view] [edit]
---
layout: default
title: StringPrep
nav_order: 7
parent: Chars and Strings
---
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# StringPrep
## Overview
Comparing strings in a consistent manner becomes imperative when a large
repertoire of characters such as Unicode is used in network protocols.
StringPrep provides sets of rules for use of Unicode and syntax for prevention
of spoofing. The implementation of StringPrep and IDNA services and their usage
in ICU is described below.
## StringPrep
StringPrep, the process of preparing Unicode strings for use in network
protocols is defined in RFC 3454 (<http://www.rfc-editor.org/rfc/rfc3454.txt> ).
The RFC defines a broad framework and rules for processing the strings.
Protocols that prescribe use of StringPrep must define a profile of StringPrep,
whose applicability is limited to the protocol. Profiles are a set of rules and
data tables which describe the how the strings should be prepare. The profiles
can choose to turn on or turn off normalization, checking for bidirectional
characters. They can also choose to add or remove mappings, unassigned and
prohibited code points from the tables provided.
StringPrep uses Unicode Version 3.2 and defines a set of tables for use by the
profiles. The profiles can chose to include or exclude tables or code points
from the tables defined by the RFC.
StringPrep defines tables that can be broadly classified into
1. *Unassigned Table*: Contains code points that are unassigned in Unicode
Version 3.2. Unassigned code points may be allowed or disallowed in the
output string depending on the application. The table in Appendix A.1 of the
RFC contains the code points.
1. *Mapping Tables*: Code points that are commonly deleted from the output and
code points that are case mapped are included in this table. There are two
mapping tables in the Appendix namely B.1 and B.2
2. *Prohibited Tables*: Contains code points that are prohibited from the
output string. Control codes, private use area code points, non-character
code points, surrogate code points, tagging and deprecated code points are
included in this table. There are nine mapping tables in Appendix which
include the prohibited code points namely C.1, C.2, C.3, C.4, C.5, C.6, C.7,
C.8 and C.9.
The procedure for preparing strings for use can be described in the following
steps:
1. *Map*: For each code point in the input check if it has a mapping defined in
the mapping table, if so, replace it with the mapping in the output.
2. *Normalize*: Normalize the output of step 1 using Unicode Normalization Form
NFKC, it the option is set. Normalization algorithm must conform to UAX 15.
3. *Prohibit*: For each code point in the output of step 2 check if the code
point is present in the prohibited table, if so, fail returning an error.
4. *Check BiDi*: Check for code points with strong right-to-left directionality
in the output of step 3. If present, check if the string satisfies the rules
for bidirectional strings as specified.
## NamePrep
NamePrep is a profile of StringPrep for use in IDNA. This profile in defined in
RFC 3491(<http://www.rfc-editor.org/rfc/rfc3491.txt> ).
The profile specifies the following rules:
1. *Map* : Include all code point mappings specified in the StringPrep.
2. *Normalize*: Normalize the output of step 1 according to NFKC.
3. *Prohibit*: Prohibit all code points specified as prohibited in StringPrep
except for the space ( U+0020) code point from the output of step 2.
4. *Check BiDi*: Check for bidirectional code points and process according to
the rules specified in StringPrep.
## Punycode
Punycode is an encoding scheme for Unicode for use in IDNA. Punycode converts
Unicode text to unique sequence of ASCII text and back to Unicode. It is an
ASCII Compatible Encoding (ACE). Punycode is described in RFC 3492
(<http://www.rfc-editor.org/rfc/rfc3492.txt> ).
The Punycode algorithm is a form of a general Bootstring algorithm which allows
strings composed of smaller set of code points to uniquely represent any string
of code points from a larger set. Punycode represents Unicode code points from
U+0000 to U+10FFFF by using the smaller ASCII set U+0000 to U+0007F. The
algorithm can also preserve case information of the code points in the lager set
while and encoding and decoding. This feature, however, is not used in IDNA.
## Internationalizing Domain Names in Applications (IDNA)
The Domain Name Service (DNS) protocol defines the procedure for matching of
ASCII strings case insensitively to the names in the lookup tables containing
mapping of IP (Internet Protocol) addresses to server names. When Unicode is
used instead of ASCII in server names then two problems arise which need to be
dealt with differently. When the server name is displayed to the user then
Unicode text should be displayed. When Unicode text is stored in lookup tables,
for compatibility with older DNS protocol and the resolver libraries, the text
should be the ASCII equivalent. The IDNA protocol, defined by RFC 3490
(<http://www.rfc-editor.org/rfc/rfc3490.txt> ), satisfies the above
requirements.
Server names stored in the DNS lookup tables are usually formed by concatenating
domain labels with a label separator, for example:
The protocol defines operations to be performed on domain labels before the
names are stored in the lookup tables and before the names fetched from lookup
tables are displayed to the user. The operations are :
1. ToASCII: This operation is performed on domain labels before sending the
name to a resolver and before storing the name in the DNS lookup table. The
domain labels are processed by StringPrep algorithm by using the rules
specified by NamePrep profile. The output of this step is then encoded by
using Punycode and an ACE prefix is added to denote that the text is encoded
using Punycode. IDNA uses xn--” before the encoded label.
1. ToUnicode: This operation is performed on domain labels before displaying
the names to to users. If the domain label is prefixed with the ACE prefix
for IDNA, then the label excluding the prefix is decoded using Punycode. The
output of Punycode decoder is verified by applying ToASCII operation and
comparing the output with the input to the ToUnicode operation.
Unicode contains code points that are glyphically similar to the ASCII Full Stop
(U+002E). These code points must be treated as label separators when performing
ToASCII operation. These code points are :
1. Ideographic Full Stop (U+3002)
2. Full Width Full Stop (U+FF0E)
3. Half Width Ideographic Full Stop (U+FF61)
Unassigned code points in Unicode Version 3.2 as given in StringPrep tables are
treated differently depending on how the processed string is used. For query
operations, where a registrar is requested for information regarding
availability of a certain domain name, unassigned code points are allowed to be
present in the string. For storing the string in DNS lookup tables, unassigned
code points are prohibited from the input.
IDNA specifies that the ToUnicode and ToASCII have options to check for
Letter-Digit-Hyphen code points and adhere to the STD3 ASCII Rules.
IDNA specifies that domain labels are equivalent if and only if the output of
ToASCII operation on the labels match using case insensitive ASCII comparison.
## StringPrep Service in ICU
The StringPrep service in ICU is data driven. The service is based on
Open-Use-Close pattern. A StringPrep profile is opened, the strings are
processed according to the rules specified in the profile and the profile is
closed once the profile is ready to be disposed.
Tools for filtering RFC 3454 and producing a rule file that can be compiled into
a binary format containing all the information required by the service are
provided.
The procedure for producing a StringPrep profile data file are as given below:
1. Run filterRFC3454.pl Perl tool, to filter the RFC file and produce a rule
file. The text file produced can be edited by the clients to add/delete
mappings or add/delete prohibited code points.
2. Run the gensprep tool to compile the rule file into a binary format. The
options to turn on normalization of strings and checking of bidirectional
code points are passed as command line options to the tool. This tool
produces a binary profile file with the extension spp”.
3. Open the StringPrep profile with path to the binary and name of the binary
profile file as the options to the open call. The profile data files are
memory mapped and cached for optimum performance.
### Code Snippets
> :point_right: **Note**: The code snippets demonstrate the usage of the APIs. Applications should
keep the profile object around for reuse, instead of opening and closing the
profile each time.*
#### C++
UErrorCode status = U_ZERO_ERROR;
UParseError parseError;
/* open the StringPrep profile */
UStringPrepProfile* nameprep = usprep_open("/usr/joe/mydata",
"nfscsi", &status);
if(U_FAILURE(status)) {
/* handle the error */
}
/* prepare the string for use according
* to the rules specified in the profile
*/
int32_t retLen = usprep_prepare(src, srcLength, dest,
destCapacity, USPREP_ALLOW_UNASSIGNED,
nameprep, &parseError, &status);
/* close the profile */
usprep_close(nameprep);
#### Java
private static final StringPrep nfscsi = null;
//singleton instance
private static final NFSCSIStringPrep prep=new NFSCSIStringPrep();
private NFSCSIStringPrep() {
try {
InputStream nfscsiFile = TestUtil.getDataStream("nfscsi.spp");
nfscsi = new StringPrep(nfscsiFile);
nfscsiFile.close();
} catch(IOException e) {
throw new RuntimeException(e.toString());
}
}
private static byte[] prepare(byte[] src, StringPrep prep)
throws StringPrepParseException, UnsupportedEncodingException {
String s = new String(src, "UTF-8");
UCharacterIterator iter = UCharacterIterator.getInstance(s);
StringBuffer out = prep.prepare(iter,StringPrep.DEFAULT);
return out.toString().getBytes("UTF-8");
}
## IDNA API in ICU
ICU provides APIs for performing the ToASCII, ToUnicode and compare operations
as defined by the RFC 3490. Convenience methods for comparing IDNs are also
provided. These APIs follow ICU policies for string manipulation and coding
guidelines.
### Code Snippets
> :point_right: **Note**: The code snippets demonstrate the usage of the APIs. Applications should
keep the profile object around for reuse, instead of opening and closing the
profile each time.*
### ToASCII operation
***C***
UChar* dest = (UChar*) malloc(destCapacity * U_SIZEOF_UCHAR);
destLen = uidna_toASCII(src, srcLen, dest, destCapacity,
UIDNA_DEFAULT, &parseError, &status);
if(status == U_BUFFER_OVERFLOW_ERROR) {
status = U_ZERO_ERROR;
destCapacity= destLen + 1; /* for the terminating Null */
free(dest); /* free the memory */
dest = (UChar*) malloc(destLen * U_SIZEOF_UCHAR);
destLen = uidna_toASCII(src, srcLen, dest, destCapacity,
UIDNA_DEFAULT, &parseError, &status);
}
if(U_FAILURE(status)) {
/* handle the error */
}
/* do interesting stuff with output*/
***Java***
try {
StringBuffer out= IDNA.convertToASCII(inBuf,IDNA.DEFAULT);
} catch(StringPrepParseException ex) {
/*handle the exception*/
}
### toUnicode operation
***C***
UChar * dest = (UChar *) malloc(destCapacity * U_SIZEOF_UCHAR);
destLen = uidna_toUnicode(src, srcLen, dest, destCapacity,
UIDNA_DEFAULT
&parseError, &status);
if(status == U_BUFFER_OVERFLOW_ERROR) {
status = U_ZERO_ERROR;
destCapacity= destLen + 1; /* for the terminating Null */
/* free the memory */
free(dest);
dest = (UChar*) malloc(destLen * U_SIZEOF_UCHAR);
destLen = uidna_toUnicode(src, srcLen, dest, destCapacity,
UIDNA_DEFAULT, &parseError, &status);
}
if(U_FAILURE(status)) {
/* handle the error */
}
/* do interesting stuff with output*/
***Java***
try {
StringBuffer out= IDNA.convertToUnicode(inBuf,IDNA.DEFAULT);
} catch(StringPrepParseException ex) {
// handle the exception
}
### compare operation
***C***
int32_t rc = uidna_compare(source1, length1,
source2, length2,
UIDNA_DEFAULT,
&status);
if(rc==0) {
/* the IDNs are same ... do something interesting */
} else {
/* the IDNs are different ... do something */
}
***Java***
try {
int retVal = IDNA.compare(s1,s2,IDNA.DEFAULT);
// do something interesting with retVal
} catch(StringPrepParseException e) {
// handle the exception
}
## Design Considerations
StringPrep profiles exhibit the following characteristics:
1. The profiles contain information about code points. StringPrep allows
profiles to add/delete code points or mappings.
2. Options such as turning normalization and checking for bidirectional code
points on or off are the properties of the profiles.
3. The StringPrep algorithm is not overridden by the profile.
4. Once defined, the profiles do not change.
The StringPrep profiles are used in network protocols so runtime performance is
important.
Many profiles have been and are being defined, so applications should be able to
plug-in arbitrary profiles and get the desired result out of the framework.
ICU is designed for this usage by providing build-time tools for arbitrary
StringPrep profile definitions, and loading them from application-supplied data
in binary form with data structures optimized for runtime use.
## Demo
A web application at <https://icu4c-demos.unicode.org/icu-bin/idnbrowser>
illustrates the use of IDNA API. The source code for the application is
available at <https://github.com/unicode-org/icu-demos/tree/main/idnbrowser>.
## Appendix
#### NFS Version 4 Profiles
Network File System Version 4 defined by RFC 3530
(<http://www.rfc-editor.org/rfc/rfc3530.txt> ) defines use of Unicode text in
the protocol. ICU provides the requisite profiles as part of test suite and code
for processing the strings according the profiles as a part of samples.
The RFC defines three profiles :
1. *nfs4_cs_prep Profile*: This profile is used for preparing file and path
name strings. Normalization of code points and checking for bidirectional
code points are turned off. Case mappings are included if the NFS
implementation supports case insensitive file and path names.
2. *nfs4_cis_prep Profile*: This profile is used for preparing NFS server
names. Normalization of code points and checking for bidirectional code
points are turned on. This profile is equivalent to NamePrep profile.
3. *nfs4_mixed_prep Profile*: This profile is used for preparing strings in the
Access Control Entries of NFS servers. These strings consist of two parts,
prefix and suffix, separated by '@' (U+0040). The prefix is processed with
case mappings turned off and the suffix is processed with case mappings
turned on. Normalization of code points and checking for bidirectional code
points are turned on.
#### XMPP Profiles
Extensible Messaging and Presence Protocol (XMPP) is an XML based protocol for
near real-time extensible messaging and presence. This protocol defines use of
two StringPrep profiles:
1. *ResourcePrep Profile*: This profile is used for processing the resource
identifiers within XMPP. Normalization of code points and checking of
bidirectional code points are turned on. Case mappings are excluded. The
space code point (U+0020) is excluded from the prohibited code points set.
2. *NodePrep Profile*: This profile is used for processing the node identifiers
within XMPP. Normalization of code points and checking of bidirectional code
points are turned on. Case mappings are included. All code points specified
as prohibited in StringPrep are prohibited. Additional code points are added
to the prohibited set.