source/data/unidata/changes.txt - external/github.com/unicode-org/icu - Git at Google

 Unicode 4.1 update

 *** related Jitterbugs

 4332 RFE: Update to Unicode 4.1
 4157 RBBI, TR29 4.1 updates

 *** data files & enums & parser code

 * file preparation
 - ucdstrip:
     DerivedCoreProperties.txt
     DerivedNormalizationProps.txt
     NormalizationTest.txt
     GraphemeBreakProperty.txt
     SentenceBreakProperty.txt
     WordBreakProperty.txt
 - ucdstrip and ucdmerge:
     EastAsianWidth.txt
     LineBreak.txt

 * add new files to the repository
     GraphemeBreakProperty.txt
     SentenceBreakProperty.txt
     WordBreakProperty.txt

 * update FractionalUCA.txt and UCARules.txt with new canonical closure

 * genpname
 - handle new enumerated properties in sub read_uchar
 - run preparse.pl

 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
 - new binary properties
   + Pattern_Syntax
   + Pattern_White_Space
 - new enumerated properties
   + Grapheme_Cluster_Break
   + Sentence_Break
   + Word_Break
 - new block & script & line break values

 * gencase
 - case-ignorable changes
   see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
   now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk

 *** Unicode version numbers
 - makedata.mak
 - uchar.h
 - configure.in

 *** tests
 - verify that u_charMirror() round-trips
 - test all new properties and some new values of old properties

 *** other code

 * hardcoded Unihan range end/limit
 - Unihan range end moves from 9FA5 to 9FBB
   search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive)
   + do not modify BOCU/BOCSU code because that would change the encoding
     and break binary compatibility!
   + similarly, do not change the GB 18030 range data (ucnvmbcs.c),
     NamePrepProfile.txt
   + ignore trietest.c: test data is arbitrary
   + ignore tstnorm.cpp: test optimization, not important
   + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF
   + do change line_th.txt and word_th.txt
     by replacing hardcoded ranges with the new property values
   + do change gennames.c

 source\data\brkitr\line_th.txt(229):        \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
 source\data\brkitr\word_th.txt(23):        \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
 source\tools\gennames\gennames.c(971):        0x4e00, 0x9fa5,

 * case mappings
 - compare new special casing context conditions with previous ones
   see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods

 * genpname
 - consider storing only the short name if it is the same as the long name

 *** other reviews
 - UAX #29 changes (grapheme/word/sentence breaks)
 - UAX #14 changes (line breaks)
 - Pattern_Syntax & Pattern_White_Space

 ---------------------------------------------------------------------------- ***

 Unicode 4.0.1 update

 *** related Jitterbugs

 3170 RFE: Update to Unicode 4.0.1
 3171 Add new Unicode 4.0.1 properties
 3520 use Unicode 4.0.1 updates for break iteration

 *** data files & enums & parser code

 * file preparation
 - ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt
 - ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt

 * file fixes
 - fix UnicodeData.txt general categories of Ethiopic digits Nd->No
   according to PRI #26
   http://www.unicode.org/review/resolved-pri.html#pri26
 - undone again because no corrigendum in sight;
   instead modified tests to not check consistency on this for Unicode 4.0.1

 * ucdterms.txt
 - update from http://www.unicode.org/copyright.html
   formatted for plain text

 * uchar.h & uprops.h & uprops.c & genprops
 - add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed
 - add U_LB_INSEPARABLE due to a spelling fix
   + put short name comment only on line with new constant
     for genpname perl script parser
 - new binary properties
   + STerm
   + Variation_Selector

 * genpname
 - fix genpname perl script so that it doesn't choke on more than 2 names per property value
 - perl script: correctly calculate the maximum number of fields per row

 * uscript.h
 - new script code Hrkt=Katakana_Or_Hiragana

 * gennorm.c track changes in DerivedNormalizationProps.txt
 - "FNC" -> "FC_NFKC"
 - single field "NFD_NO" -> two fields "NFD_QC; N" etc.

 * genprops/props2.c track changes in DerivedNumericValues.txt
 - changed from 3 columns to 2, dropping the numeric type
   + assume that the type is always numeric for Han characters,
     and that only those are added in addition to what UnicodeData.txt lists

 *** Unicode version numbers
 - makedata.mak
 - uchar.h
 - configure.in

 *** tests
 - update test of default bidi classes according to PRI #28
   /tsutil/cucdtst/TestUnicodeData
   http://www.unicode.org/review/resolved-pri.html#pri28
 - bidi tests: change exemplar character for ES depending on Unicode version
 - change hardcoded expected property values where they change

 *** other code

 * name matching
 - read UCD.html

 * scripts
 - use new Hrkt=Katakana_Or_Hiragana

 * ZWJ & ZWNJ
 - are now part of combining character sequences
 - break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ
	Unicode 4.1 update

	*** related Jitterbugs

	4332 RFE: Update to Unicode 4.1
	4157 RBBI, TR29 4.1 updates

	*** data files & enums & parser code

	* file preparation
	- ucdstrip:
	DerivedCoreProperties.txt
	DerivedNormalizationProps.txt
	NormalizationTest.txt
	GraphemeBreakProperty.txt
	SentenceBreakProperty.txt
	WordBreakProperty.txt
	- ucdstrip and ucdmerge:
	EastAsianWidth.txt
	LineBreak.txt

	* add new files to the repository
	GraphemeBreakProperty.txt
	SentenceBreakProperty.txt
	WordBreakProperty.txt

	* update FractionalUCA.txt and UCARules.txt with new canonical closure

	* genpname
	- handle new enumerated properties in sub read_uchar
	- run preparse.pl

	* uchar.h & uscript.h & uprops.h & uprops.c & genprops
	- new binary properties
	+ Pattern_Syntax
	+ Pattern_White_Space
	- new enumerated properties
	+ Grapheme_Cluster_Break
	+ Sentence_Break
	+ Word_Break
	- new block & script & line break values

	* gencase
	- case-ignorable changes
	see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
	now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk

	*** Unicode version numbers
	- makedata.mak
	- uchar.h
	- configure.in

	*** tests
	- verify that u_charMirror() round-trips
	- test all new properties and some new values of old properties

	*** other code

	* hardcoded Unihan range end/limit
	- Unihan range end moves from 9FA5 to 9FBB
	search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive)
	+ do not modify BOCU/BOCSU code because that would change the encoding
	and break binary compatibility!
	+ similarly, do not change the GB 18030 range data (ucnvmbcs.c),
	NamePrepProfile.txt
	+ ignore trietest.c: test data is arbitrary
	+ ignore tstnorm.cpp: test optimization, not important
	+ ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF
	+ do change line_th.txt and word_th.txt
	by replacing hardcoded ranges with the new property values
	+ do change gennames.c

	source\data\brkitr\line_th.txt(229): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
	source\data\brkitr\word_th.txt(23): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
	source\tools\gennames\gennames.c(971): 0x4e00, 0x9fa5,

	* case mappings
	- compare new special casing context conditions with previous ones
	see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods

	* genpname
	- consider storing only the short name if it is the same as the long name

	*** other reviews
	- UAX #29 changes (grapheme/word/sentence breaks)
	- UAX #14 changes (line breaks)
	- Pattern_Syntax & Pattern_White_Space

	---------------------------------------------------------------------------- ***

	Unicode 4.0.1 update

	*** related Jitterbugs

	3170 RFE: Update to Unicode 4.0.1
	3171 Add new Unicode 4.0.1 properties
	3520 use Unicode 4.0.1 updates for break iteration

	*** data files & enums & parser code

	* file preparation
	- ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt
	- ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt

	* file fixes
	- fix UnicodeData.txt general categories of Ethiopic digits Nd->No
	according to PRI #26
	http://www.unicode.org/review/resolved-pri.html#pri26
	- undone again because no corrigendum in sight;
	instead modified tests to not check consistency on this for Unicode 4.0.1

	* ucdterms.txt
	- update from http://www.unicode.org/copyright.html
	formatted for plain text

	* uchar.h & uprops.h & uprops.c & genprops
	- add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed
	- add U_LB_INSEPARABLE due to a spelling fix
	+ put short name comment only on line with new constant
	for genpname perl script parser
	- new binary properties
	+ STerm
	+ Variation_Selector

	* genpname
	- fix genpname perl script so that it doesn't choke on more than 2 names per property value
	- perl script: correctly calculate the maximum number of fields per row

	* uscript.h
	- new script code Hrkt=Katakana_Or_Hiragana

	* gennorm.c track changes in DerivedNormalizationProps.txt
	- "FNC" -> "FC_NFKC"
	- single field "NFD_NO" -> two fields "NFD_QC; N" etc.

	* genprops/props2.c track changes in DerivedNumericValues.txt
	- changed from 3 columns to 2, dropping the numeric type
	+ assume that the type is always numeric for Han characters,
	and that only those are added in addition to what UnicodeData.txt lists

	*** Unicode version numbers
	- makedata.mak
	- uchar.h
	- configure.in

	*** tests
	- update test of default bidi classes according to PRI #28
	/tsutil/cucdtst/TestUnicodeData
	http://www.unicode.org/review/resolved-pri.html#pri28
	- bidi tests: change exemplar character for ES depending on Unicode version
	- change hardcoded expected property values where they change

	*** other code

	* name matching
	- read UCD.html

	* scripts
	- use new Hrkt=Katakana_Or_Hiragana

	* ZWJ & ZWNJ
	- are now part of combining character sequences
	- break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ