blob: 9bf18a08ad6420e47afa6ccf0c352c5cb9cfe98d [file] [log] [blame] [view]
---
layout: default
title: Compression
nav_order: 4
parent: Conversion
---
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Compression
{: .no_toc }
## Contents
{: .no_toc .text-delta }
1. TOC
{:toc}
---
## Overview of SCSU
Compressing Unicode text for transmission or storage results in minimal
bandwidth usage and fewer storage devices. The compression scheme compresses
Unicode text into a sequence of bytes by using characteristics of Unicode text.
The compressed sequence can be used on its own or as further input to a general
purpose file or disk-block based compression scheme. Note that the combination
of the Unicode compression algorithm plus disk-block based compression produces
better results than either method alone.
Strings in languages using small alphabets contain runs of characters that are
coded close together in Unicode. These runs are typically interrupted only by
punctuation characters, which are themselves coded in proximity to each other in
Unicode (usually in the Basic Latin range).
For additional detail about the compression algorithm, which has been approved
by the Unicode Consortium, please refer to [Unicode Technical Report #6 (A
Standard Compression Scheme for
Unicode)](https://www.unicode.org/reports/tr6/).
The Standard Compression Scheme for Unicode (SCSU) is used to:
* express all code points in Unicode
* approximate the storage size of traditional character sets
* facilitate the use of short strings
* provide transparency for characters between `U+0020`-`U+00FF`, as well as `CR`, `LF`
and `TAB`
* support very simple decoders
* support simple as well as sophisticated encoders
It does not attempt to avoid the use of control bytes (including `NUL`) in the
compressed stream.
The compression scheme is mainly intended for use with short to medium length
Unicode strings. The resulting compressed format is intended for storage or
transmission in bandwidth limited environments. It can be used stand-alone or as
input to traditional general purpose data compression schemes. It is not
intended as processing format or as general purpose interchange format.
## BOCU-1
A MIME compatible encoding called BOCU-1 is also available in ICU. Details about
this encoding can be found in the [Unicode Technical Note
#6](https://www.unicode.org/notes/tn6/). Both SCSU and BOCU-1 are IANA
registered names.
## Usage
The compression service in ICU is a part of Conversion framework, and follows
the semantics of converters. For more information on how to use ICU's conversion
service, please refer to the Usage Model section in the [Using
Converters](converters.md) chapter.
```c++
uint16_t germanUTF16[]={
0x00d6, 0x006c, 0x0020, 0x0066, 0x006c, 0x0069, 0x0065, 0x00df, 0x0074
};
uint8_t germanSCSU[]={
0xd6, 0x6c, 0x20, 0x66, 0x6c, 0x69, 0x65, 0xdf, 0x74
};
char target[100];
UChar uTarget[100];
UErrorCode status = U_ZERO_ERROR;
UConverter *conv;
int32_t len;
/* set up the SCSU converter */
conv = ucnv_open("SCSU", &status);
assert(U_SUCCESS(status));
/* compress the string using SCSU */
len = ucnv_fromUChars(conv, target, 100, germanUTF16, -1, &status);
assert(U_SUCCESS(status));
len = ucnv_toUChars(conv, uTarget, 100, germanSCSU, -1, &status);
/* close the converter */
ucnv_close(conv);
```