Text Normalization / 秋梦无痕

Text Normalization

From: Core Java Technologies Tech Tips for February 2007

Text Normalization
by Sergey Groznyh

Text normalization is a text transformation that makes the text consistent with some pre-defined rules. Examples include white-space stripping, punctuation removal, uppercase/lowercase translation, and so on. This tech tip discusses one important form of text normalization, Unicode text normalization. Unless otherwise specified, the term Unicode in this article refers to Unicode 4.0 because this is the Unicode version supported by the Java SE 6 platform.

The Unicode standard defines two equivalences between characters and sequences of characters. They are the canonical equivalence and the compatibility equivalence. One example of canonical equivalence is a precomposed character and its equivalent combining sequence. For example, the Unicode character 'Ç' (LATIN CAPITAL LETTER C WITH CEDILLA ) has the Unicode character value U+00C7. The Unicode character sequence U+0043 U+0327 also creates the 'Ç' character. The sequence contains the character values for LATIN CAPITAL LETTER C followed by the COMBINING CEDILLA. The single character and the character sequence are canonically equivalent because they are visually indistinguishable and mean exactly the same for the purposes of text comparison and rendering.

Compatibility equivalence, on the other hand, deals mostly with legacy character sets which define alternate visual representations of the same character or character sequence. An example of a compatibility equivalence is equivalence between the DIGIT TWO character '2' (U+0032 ) and the SUPERSCRIPT TWO character '²' (U+00B2 ). Both characters exists in the character set ISO/IEC 8859-1 (Latin1). The DIGIT TWO character '2' and the SUPERSCRIPT TWO character '²' are compatibility equivalent because they are variants of the same basic character. Because the characters are visually distinguishable and have additional semantic information, the characters are not canonically equivalent.

Unicode text normalization is a process of translating characters and character sequences from one equivalent form into another. Unicode defines four normalization standards.

NFC
Normalization Form Canonical Composition. Characters are decomposed and then recomposed by canonical equivalence. For example, sequences like "letter+combining marks" are composed to form a single character if possible.
NFD
Normalization Form Canonical Decomposition. Characters are decomposed by canonical equivalence. For example, the precomposed character 'Ç' (U+00C7 ) transforms to a combining sequence containing a base character and a combining accent.
NFKC
Normalization Form Compatibility Composition. Characters are decomposed by compatibility equivalence then recomposed by canonical equivalence.
NFKD
Normalization Form Compatibility Decomposition. Characters are decomposed by compatibility equivalence. For example, the fraction '½' (U+00BD ) transforms into a sequence of three characters: 1/2.
The Normalizer class

Java SE 6 supports Unicode text normalization by providing the now public class java.text.Normalizer. This class defines both the normalize method that transforms text and the Form enumeration that represents the Unicode normalization forms NFC, NFD, NFKC, and NFKD.

Possible applications of various Unicode normalization forms are shown below:

Example: NFC

Suppose you want to publish a document on the Web. The Character Model for the World Wide Web specification recommends that in order to improve indexing, searching and other text related functionality of the Web, data should be normalized before publishing (early normalization). The specification states that NFC is preferred because almost all legacy data as well as data created by current software is already normalized to NFC. The following code reads data from standard input and writes NFC-normalized data to standard output. The UTF-8 encoding is used for both input and output.

import java.io.*;
import java.text.Normalizer;
import java.text.Normalizer.Form;
public class NFC {
public static void main(String[] args) {
final String INPUT_ENC = "UTF-8";
final String OUTPUT_ENC = "UTF-8";
try {
BufferedReader r = new BufferedReader(
new InputStreamReader(System.in, INPUT_ENC));
PrintWriter w = new PrintWriter(
new OutputStreamWriter(System.out, OUTPUT_ENC), true);
String s;
while ((s = r.readLine()) != null) {
w.println(Normalizer.normalize(s, Form.NFC));
}
}
catch (Exception ex) {
ex.printStackTrace();
}
}
}

The NFC normalization is also well suited for string equality tests. Note that the java.text.Collator class, initialized with the appropriate locale, should be used for string comparisons. The reason for using the Collator class is that sorting order for accented letters differs in different languages. For sorting purposes, some languages place accented letters right after the base letter, and some place accented letters after all base letters.

Example: NFD

Suppose you are developing a phone directory application. You store the directory data in some database and have a search form to look up the data. As people names around the world contain accented characters, you have two problems: many databases do not like accented characters, and many users of your application will not bother to enter, or just cannot enter the correct (accented) names into the search form of your application. So you must remove all accents from both the data stored in the database, and the data read from the search form.

The following code reads standard input line by line, strips accented characters from each line and writes the result to standard output. The UTF-8 encoding is used for both input and output.

import java.io.*;
import java.text.Normalizer;
import java.text.Normalizer.Form;
public class NFD {
public static void main(String[] args) {
final String INPUT_ENC = "UTF-8";
final String OUTPUT_ENC = "UTF-8";
try {
BufferedReader r = new BufferedReader(
new InputStreamReader(System.in, INPUT_ENC));
PrintWriter w = new PrintWriter(
new OutputStreamWriter(System.out, OUTPUT_ENC), true);
String s;
while ((s = r.readLine()) != null) {
// decompose and remove accents
String decomposed = Normalizer.normalize(s, Form.NFD);
String accentsGone =
decomposed.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
w.println(accentsGone);
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}

Example: NFKC

NFKC normalization affects characters with combining marks that have a compatibility decomposition form. So, the character sequence U+1E9B U+0323(LATIN SMALL LETTER LONG S WITH DOT ABOVE followed by the COMBINING DOT BELOW ) is transformed to the single character value U+1E69. The normalized character is 'ṩ' (LATIN SMALL LETTER S WITH DOT BELOW AND DOT ABOVE ).

This normalization form is required to comply with the string profile specification for the International Domain Names (RFC 3491). If a domain name contains non-ASCII characters, it must be normalized to the NFKC form. So if you are building an application that registers international domain names, you must encode the names to this form.

Note that the International Domain Names encoding specifies Unicode version 3.2, and there are differences in normalization forms for some CJK ideographic characters between Unicode versions 3.2 and 4.0. If you are not implementing RFC 3491 and just want to get the normalized domain name, you may use the facilities provided by the java.net.IDN class.

The encoding process is similar to those showed in the NFC example, the only difference is that Form.NFKC encoding form should be used instead of Form.NFC.

Example: NFKD

This form of normalization is useful when legacy text data is converted to XML format. The Unicode in XML and other Markup Languages specification defines several rules for dealing with compatibility characters. For example, it recommends using <sup> and <sub> markup for superscripts and subscripts, using MathML markup for expressing fractions, using list item marker styles instead of circled digits, and so on. If you are building an application that converts legacy data to XML, you should consider applying the appropriate markup and/or styles to text data that has been normalized to NFKD.

In order to convert data to NFKD form, you should pass Form.NFKD as the second parameter to the Normalizer.normalize method:
Normalizer.normalize(s, Form.NFKD);

Normalization Testing
The java.text.Normalizer class defines the isNormalized method, which checks whether a given character sequence is normalized according to one of the four normalization forms. The following code reads lines from standard input and reports whether the line is normalized to any of the four forms. Input is UTF-8 encoded.

import java.io.*;
import java.text.Normalizer;
import java.text.Normalizer.Form;
public class IsNormalized {
public static void main(String[] args) {
final String INPUT_ENC = "UTF-8";
final Form[] forms = { Form.NFC, Form.NFD, Form.NFKC, Form.NFKD };
try {
BufferedReader r = new BufferedReader(
new InputStreamReader(System.in, INPUT_ENC));
String s;
int line = 1;
while ((s = r.readLine()) != null) {
System.out.printf("%5d:", line++);
for (Form f : forms) {
if (Normalizer.isNormalized(s, f)) {
System.out.print(" " + f.toString());
}
}
System.out.println();
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}