docs/userguide/conversion/compression.md - external/github.com/unicode-org/icu - Git at Google

 ---
 layout: default
 title: Compression
 nav_order: 4
 parent: Conversion
 ---
 <!--
 © 2020 and later: Unicode, Inc. and others.
 License & terms of use: http://www.unicode.org/copyright.html
 -->

 # Compression
 {: .no_toc }

 ## Contents
 {: .no_toc .text-delta }

 1. TOC
 {:toc}

 ---

 ## Overview of SCSU

 Compressing Unicode text for transmission or storage results in minimal
 bandwidth usage and fewer storage devices. The compression scheme compresses
 Unicode text into a sequence of bytes by using characteristics of Unicode text.
 The compressed sequence can be used on its own or as further input to a general
 purpose file or disk-block based compression scheme. Note that the combination
 of the Unicode compression algorithm plus disk-block based compression produces
 better results than either method alone.

 Strings in languages using small alphabets contain runs of characters that are
 coded close together in Unicode. These runs are typically interrupted only by
 punctuation characters, which are themselves coded in proximity to each other in
 Unicode (usually in the Basic Latin range).

 For additional detail about the compression algorithm, which has been approved
 by the Unicode Consortium, please refer to [Unicode Technical Report #6 (A
 Standard Compression Scheme for
 Unicode)](https://www.unicode.org/reports/tr6/).

 The Standard Compression Scheme for Unicode (SCSU) is used to:

 *   express all code points in Unicode

 *   approximate the storage size of traditional character sets

 *   facilitate the use of short strings

 *   provide transparency for characters between `U+0020`-`U+00FF`, as well as `CR`, `LF`
     and `TAB`

 *   support very simple decoders

 *   support simple as well as sophisticated encoders

 It does not attempt to avoid the use of control bytes (including `NUL`) in the
 compressed stream.

 The compression scheme is mainly intended for use with short to medium length
 Unicode strings. The resulting compressed format is intended for storage or
 transmission in bandwidth limited environments. It can be used stand-alone or as
 input to traditional general purpose data compression schemes. It is not
 intended as processing format or as general purpose interchange format.

 ## BOCU-1

 A MIME compatible encoding called BOCU-1 is also available in ICU. Details about
 this encoding can be found in the [Unicode Technical Note
 #6](https://www.unicode.org/notes/tn6/). Both SCSU and BOCU-1 are IANA
 registered names.

 ## Usage

 The compression service in ICU is a part of Conversion framework, and follows
 the semantics of converters. For more information on how to use ICU's conversion
 service, please refer to the Usage Model section in the [Using
 Converters](converters.md) chapter.

 ```c++
 uint16_t germanUTF16[]={
     0x00d6, 0x006c, 0x0020, 0x0066, 0x006c, 0x0069, 0x0065, 0x00df, 0x0074
 };

 uint8_t germanSCSU[]={
     0xd6, 0x6c, 0x20, 0x66, 0x6c, 0x69, 0x65, 0xdf, 0x74
 };
 char target[100];
 UChar uTarget[100];
 UErrorCode status = U_ZERO_ERROR;
 UConverter *conv;
 int32_t len;

 /* set up the SCSU converter */
 conv = ucnv_open("SCSU", &status);
 assert(U_SUCCESS(status));

 /* compress the string using SCSU */
 len = ucnv_fromUChars(conv, target, 100, germanUTF16, -1, &status);
 assert(U_SUCCESS(status));

 len = ucnv_toUChars(conv, uTarget, 100, germanSCSU, -1, &status);

 /* close the converter */
 ucnv_close(conv);
 ```
	---
	layout: default
	title: Compression
	nav_order: 4
	parent: Conversion
	---
	<!--
	© 2020 and later: Unicode, Inc. and others.
	License & terms of use: http://www.unicode.org/copyright.html
	-->

	# Compression
	{: .no_toc }

	## Contents
	{: .no_toc .text-delta }

	1. TOC
	{:toc}

	---

	## Overview of SCSU

	Compressing Unicode text for transmission or storage results in minimal
	bandwidth usage and fewer storage devices. The compression scheme compresses
	Unicode text into a sequence of bytes by using characteristics of Unicode text.
	The compressed sequence can be used on its own or as further input to a general
	purpose file or disk-block based compression scheme. Note that the combination
	of the Unicode compression algorithm plus disk-block based compression produces
	better results than either method alone.

	Strings in languages using small alphabets contain runs of characters that are
	coded close together in Unicode. These runs are typically interrupted only by
	punctuation characters, which are themselves coded in proximity to each other in
	Unicode (usually in the Basic Latin range).

	For additional detail about the compression algorithm, which has been approved
	by the Unicode Consortium, please refer to [Unicode Technical Report #6 (A
	Standard Compression Scheme for
	Unicode)](https://www.unicode.org/reports/tr6/).

	The Standard Compression Scheme for Unicode (SCSU) is used to:

	* express all code points in Unicode

	* approximate the storage size of traditional character sets

	* facilitate the use of short strings

	* provide transparency for characters between `U+0020`-`U+00FF`, as well as `CR`, `LF`
	and `TAB`

	* support very simple decoders

	* support simple as well as sophisticated encoders

	It does not attempt to avoid the use of control bytes (including `NUL`) in the
	compressed stream.

	The compression scheme is mainly intended for use with short to medium length
	Unicode strings. The resulting compressed format is intended for storage or
	transmission in bandwidth limited environments. It can be used stand-alone or as
	input to traditional general purpose data compression schemes. It is not
	intended as processing format or as general purpose interchange format.

	## BOCU-1

	A MIME compatible encoding called BOCU-1 is also available in ICU. Details about
	this encoding can be found in the [Unicode Technical Note
	#6](https://www.unicode.org/notes/tn6/). Both SCSU and BOCU-1 are IANA
	registered names.

	## Usage

	The compression service in ICU is a part of Conversion framework, and follows
	the semantics of converters. For more information on how to use ICU's conversion
	service, please refer to the Usage Model section in the [Using
	Converters](converters.md) chapter.

	```c++
	uint16_t germanUTF16[]={
	0x00d6, 0x006c, 0x0020, 0x0066, 0x006c, 0x0069, 0x0065, 0x00df, 0x0074
	};

	uint8_t germanSCSU[]={
	0xd6, 0x6c, 0x20, 0x66, 0x6c, 0x69, 0x65, 0xdf, 0x74
	};
	char target[100];
	UChar uTarget[100];
	UErrorCode status = U_ZERO_ERROR;
	UConverter *conv;
	int32_t len;

	/* set up the SCSU converter */
	conv = ucnv_open("SCSU", &status);
	assert(U_SUCCESS(status));

	/* compress the string using SCSU */
	len = ucnv_fromUChars(conv, target, 100, germanUTF16, -1, &status);
	assert(U_SUCCESS(status));

	len = ucnv_toUChars(conv, uTarget, 100, germanSCSU, -1, &status);

	/* close the converter */
	ucnv_close(conv);
	```