|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
*/ |
|
|
|
/** |
|
******************************************************************************* |
|
* Copyright (C) 1996-2014, International Business Machines Corporation and |
|
* others. All Rights Reserved. |
|
******************************************************************************* |
|
*/ |
|
|
|
package sun.text.normalizer; |
|
|
|
/** |
|
* <p>The UCharacter class provides extensions to the |
|
* <a href="http://java.sun.com/j2se/1.5/docs/api/java/lang/Character.html"> |
|
* java.lang.Character</a> class. These extensions provide support for |
|
* more Unicode properties and together with the <a href=../text/UTF16.html>UTF16</a> |
|
* class, provide support for supplementary characters (those with code |
|
* points above U+FFFF). |
|
* Each ICU release supports the latest version of Unicode available at that time. |
|
* |
|
* <p>Code points are represented in these API using ints. While it would be |
|
* more convenient in Java to have a separate primitive datatype for them, |
|
* ints suffice in the meantime. |
|
* |
|
* <p>To use this class please add the jar file name icu4j.jar to the |
|
* class path, since it contains data files which supply the information used |
|
* by this file.<br> |
|
* E.g. In Windows <br> |
|
* <code>set CLASSPATH=%CLASSPATH%;$JAR_FILE_PATH/ucharacter.jar</code>.<br> |
|
* Otherwise, another method would be to copy the files uprops.dat and |
|
* unames.icu from the icu4j source subdirectory |
|
* <i>$ICU4J_SRC/src/com.ibm.icu.impl.data</i> to your class directory |
|
* <i>$ICU4J_CLASS/com.ibm.icu.impl.data</i>. |
|
* |
|
* <p>Aside from the additions for UTF-16 support, and the updated Unicode |
|
* properties, the main differences between UCharacter and Character are: |
|
* <ul> |
|
* <li> UCharacter is not designed to be a char wrapper and does not have |
|
* APIs to which involves management of that single char.<br> |
|
* These include: |
|
* <ul> |
|
* <li> char charValue(), |
|
* <li> int compareTo(java.lang.Character, java.lang.Character), etc. |
|
* </ul> |
|
* <li> UCharacter does not include Character APIs that are deprecated, nor |
|
* does it include the Java-specific character information, such as |
|
* boolean isJavaIdentifierPart(char ch). |
|
* <li> Character maps characters 'A' - 'Z' and 'a' - 'z' to the numeric |
|
* values '10' - '35'. UCharacter also does this in digit and |
|
* getNumericValue, to adhere to the java semantics of these |
|
* methods. New methods unicodeDigit, and |
|
* getUnicodeNumericValue do not treat the above code points |
|
* as having numeric values. This is a semantic change from ICU4J 1.3.1. |
|
* </ul> |
|
* <p> |
|
* Further detail on differences can be determined using the program |
|
* <a href= |
|
* "http://source.icu-project.org/repos/icu/icu4j/trunk/src/com/ibm/icu/dev/test/lang/UCharacterCompare.java"> |
|
* com.ibm.icu.dev.test.lang.UCharacterCompare</a> |
|
* </p> |
|
* <p> |
|
* In addition to Java compatibility functions, which calculate derived properties, |
|
* this API provides low-level access to the Unicode Character Database. |
|
* </p> |
|
* <p> |
|
* Unicode assigns each code point (not just assigned character) values for |
|
* many properties. |
|
* Most of them are simple boolean flags, or constants from a small enumerated list. |
|
* For some properties, values are strings or other relatively more complex types. |
|
* </p> |
|
* <p> |
|
* For more information see |
|
* <a href="http://www.unicode/org/ucd/">"About the Unicode Character Database"</a> |
|
* (http://www.unicode.org/ucd/) |
|
* and the <a href="http://www.icu-project.org/userguide/properties.html">ICU |
|
* User Guide chapter on Properties</a> |
|
* (http://www.icu-project.org/userguide/properties.html). |
|
* </p> |
|
* <p> |
|
* There are also functions that provide easy migration from C/POSIX functions |
|
* like isblank(). Their use is generally discouraged because the C/POSIX |
|
* standards do not define their semantics beyond the ASCII range, which means |
|
* that different implementations exhibit very different behavior. |
|
* Instead, Unicode properties should be used directly. |
|
* </p> |
|
* <p> |
|
* There are also only a few, broad C/POSIX character classes, and they tend |
|
* to be used for conflicting purposes. For example, the "isalpha()" class |
|
* is sometimes used to determine word boundaries, while a more sophisticated |
|
* approach would at least distinguish initial letters from continuation |
|
* characters (the latter including combining marks). |
|
* (In ICU, BreakIterator is the most sophisticated API for word boundaries.) |
|
* Another example: There is no "istitle()" class for titlecase characters. |
|
* </p> |
|
* <p> |
|
* ICU 3.4 and later provides API access for all twelve C/POSIX character classes. |
|
* ICU implements them according to the Standard Recommendations in |
|
* Annex C: Compatibility Properties of UTS #18 Unicode Regular Expressions |
|
* (http://www.unicode.org/reports/tr18/#Compatibility_Properties). |
|
* </p> |
|
* <p> |
|
* API access for C/POSIX character classes is as follows: |
|
* <pre>{@code |
|
* - alpha: isUAlphabetic(c) or hasBinaryProperty(c, UProperty.ALPHABETIC) |
|
* - lower: isULowercase(c) or hasBinaryProperty(c, UProperty.LOWERCASE) |
|
* - upper: isUUppercase(c) or hasBinaryProperty(c, UProperty.UPPERCASE) |
|
* - punct: ((1<<getType(c)) & ((1<<DASH_PUNCTUATION)|(1<<START_PUNCTUATION)| |
|
* (1<<END_PUNCTUATION)|(1<<CONNECTOR_PUNCTUATION)|(1<<OTHER_PUNCTUATION)| |
|
* (1<<INITIAL_PUNCTUATION)|(1<<FINAL_PUNCTUATION)))!=0 |
|
* - digit: isDigit(c) or getType(c)==DECIMAL_DIGIT_NUMBER |
|
* - xdigit: hasBinaryProperty(c, UProperty.POSIX_XDIGIT) |
|
* - alnum: hasBinaryProperty(c, UProperty.POSIX_ALNUM) |
|
* - space: isUWhiteSpace(c) or hasBinaryProperty(c, UProperty.WHITE_SPACE) |
|
* - blank: hasBinaryProperty(c, UProperty.POSIX_BLANK) |
|
* - cntrl: getType(c)==CONTROL |
|
* - graph: hasBinaryProperty(c, UProperty.POSIX_GRAPH) |
|
* - print: hasBinaryProperty(c, UProperty.POSIX_PRINT) |
|
* }</pre> |
|
* </p> |
|
* <p> |
|
* The C/POSIX character classes are also available in UnicodeSet patterns, |
|
* using patterns like [:graph:] or \p{graph}. |
|
* </p> |
|
* |
|
* There are several ICU (and Java) whitespace functions. |
|
* Comparison:<ul> |
|
* <li> isUWhiteSpace=UCHAR_WHITE_SPACE: Unicode White_Space property; |
|
* most of general categories "Z" (separators) + most whitespace ISO controls |
|
* (including no-break spaces, but excluding IS1..IS4 and ZWSP) |
|
* <li> isWhitespace: Java isWhitespace; Z + whitespace ISO controls but excluding no-break spaces |
|
* <li> isSpaceChar: just Z (including no-break spaces)</ul> |
|
* </p> |
|
* <p> |
|
* This class is not subclassable. |
|
* </p> |
|
* @author Syn Wee Quek |
|
* @stable ICU 2.1 |
|
* @see com.ibm.icu.lang.UCharacterEnums |
|
*/ |
|
|
|
public final class UCharacter |
|
{ |
|
|
|
|
|
|
|
|
|
|
|
*/ |
|
public static interface JoiningGroup |
|
{ |
|
|
|
|
|
*/ |
|
public static final int NO_JOINING_GROUP = 0; |
|
} |
|
|
|
|
|
|
|
|
|
|
|
*/ |
|
public static interface NumericType |
|
{ |
|
|
|
|
|
*/ |
|
public static final int NONE = 0; |
|
|
|
|
|
*/ |
|
public static final int DECIMAL = 1; |
|
|
|
|
|
*/ |
|
public static final int DIGIT = 2; |
|
|
|
|
|
*/ |
|
public static final int NUMERIC = 3; |
|
|
|
|
|
*/ |
|
public static final int COUNT = 4; |
|
} |
|
|
|
|
|
|
|
|
|
|
|
|
|
*/ |
|
public static interface HangulSyllableType |
|
{ |
|
|
|
|
|
*/ |
|
public static final int NOT_APPLICABLE = 0; /*[NA]*/ /*See note !!*/ |
|
|
|
|
|
*/ |
|
public static final int LEADING_JAMO = 1; /*[L]*/ |
|
|
|
|
|
*/ |
|
public static final int VOWEL_JAMO = 2; /*[V]*/ |
|
|
|
|
|
*/ |
|
public static final int TRAILING_JAMO = 3; /*[T]*/ |
|
|
|
|
|
*/ |
|
public static final int LV_SYLLABLE = 4; /*[LV]*/ |
|
|
|
|
|
*/ |
|
public static final int LVT_SYLLABLE = 5; /*[LVT]*/ |
|
|
|
|
|
*/ |
|
public static final int COUNT = 6; |
|
} |
|
|
|
// public data members ----------------------------------------------- |
|
|
|
|
|
|
|
|
|
*/ |
|
public static final int MIN_VALUE = UTF16.CODEPOINT_MIN_VALUE; |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
*/ |
|
public static final int MAX_VALUE = UTF16.CODEPOINT_MAX_VALUE; |
|
|
|
// public methods ---------------------------------------------------- |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
*/ |
|
public static int digit(int ch, int radix) |
|
{ |
|
if (2 <= radix && radix <= 36) { |
|
int value = digit(ch); |
|
if (value < 0) { |
|
|
|
value = UCharacterProperty.getEuropeanDigit(ch); |
|
} |
|
return (value < radix) ? value : -1; |
|
} else { |
|
return -1; |
|
} |
|
} |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
*/ |
|
public static int digit(int ch) |
|
{ |
|
return UCharacterProperty.INSTANCE.digit(ch); |
|
} |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
*/ |
|
public static int getType(int ch) |
|
{ |
|
return UCharacterProperty.INSTANCE.getType(ch); |
|
} |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
*/ |
|
public static int getDirection(int ch) |
|
{ |
|
return UBiDiProps.INSTANCE.getClass(ch); |
|
} |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
*/ |
|
public static int getMirror(int ch) |
|
{ |
|
return UBiDiProps.INSTANCE.getMirror(ch); |
|
} |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
*/ |
|
public static int getBidiPairedBracket(int c) { |
|
return UBiDiProps.INSTANCE.getPairedBracket(c); |
|
} |
|
|
|
|
|
|
|
|
|
|
|
|
|
*/ |
|
public static int getCombiningClass(int ch) |
|
{ |
|
return Normalizer2.getNFDInstance().getCombiningClass(ch); |
|
} |
|
|
|
|
|
|
|
|
|
|
|
*/ |
|
public static VersionInfo getUnicodeVersion() |
|
{ |
|
return UCharacterProperty.INSTANCE.m_unicodeVersion_; |
|
} |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
*/ |
|
public static int getCodePoint(char lead, char trail) |
|
{ |
|
if (UTF16.isLeadSurrogate(lead) && UTF16.isTrailSurrogate(trail)) { |
|
return UCharacterProperty.getRawSupplementary(lead, trail); |
|
} |
|
throw new IllegalArgumentException("Illegal surrogate characters"); |
|
} |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
*/ |
|
public static VersionInfo getAge(int ch) |
|
{ |
|
if (ch < MIN_VALUE || ch > MAX_VALUE) { |
|
throw new IllegalArgumentException("Codepoint out of bounds"); |
|
} |
|
return UCharacterProperty.INSTANCE.getAge(ch); |
|
} |
|
|
|
/** |
|
* Returns the property value for an Unicode property type of a code point. |
|
* Also returns binary and mask property values.</p> |
|
* <p>Unicode, especially in version 3.2, defines many more properties than |
|
* the original set in UnicodeData.txt.</p> |
|
* <p>The properties APIs are intended to reflect Unicode properties as |
|
* defined in the Unicode Character Database (UCD) and Unicode Technical |
|
* Reports (UTR). For details about the properties see |
|
* http://www.unicode.org/.</p> |
|
* <p>For names of Unicode properties see the UCD file PropertyAliases.txt. |
|
* </p> |
|
* <pre> |
|
* Sample usage: |
|
* int ea = UCharacter.getIntPropertyValue(c, UProperty.EAST_ASIAN_WIDTH); |
|
* int ideo = UCharacter.getIntPropertyValue(c, UProperty.IDEOGRAPHIC); |
|
* boolean b = (ideo == 1) ? true : false; |
|
* </pre> |
|
* @param ch code point to test. |
|
* @param type UProperty selector constant, identifies which binary |
|
* property to check. Must be |
|
* UProperty.BINARY_START <= type < UProperty.BINARY_LIMIT or |
|
* UProperty.INT_START <= type < UProperty.INT_LIMIT or |
|
* UProperty.MASK_START <= type < UProperty.MASK_LIMIT. |
|
* @return numeric value that is directly the property value or, |
|
* for enumerated properties, corresponds to the numeric value of |
|
* the enumerated constant of the respective property value |
|
* enumeration type (cast to enum type if necessary). |
|
* Returns 0 or 1 (for false / true) for binary Unicode properties. |
|
* Returns a bit-mask for mask properties. |
|
* Returns 0 if 'type' is out of bounds or if the Unicode version |
|
* does not have data for the property at all, or not for this code |
|
* point. |
|
* @see UProperty |
|
* @see #hasBinaryProperty |
|
* @see #getIntPropertyMinValue |
|
* @see #getIntPropertyMaxValue |
|
* @see #getUnicodeVersion |
|
* @stable ICU 2.4 |
|
*/ |
|
|
|
public static int getIntPropertyValue(int ch, int type) { |
|
return UCharacterProperty.INSTANCE.getIntPropertyValue(ch, type); |
|
} |
|
|
|
// private constructor ----------------------------------------------- |
|
|
|
|
|
|
|
*/ |
|
private UCharacter() { } |
|
|
|
/* |
|
* Copied from UCharacterEnums.java |
|
*/ |
|
|
|
|
|
|
|
|
|
*/ |
|
public static final byte NON_SPACING_MARK = 6; |
|
|
|
|
|
|
|
*/ |
|
public static final byte ENCLOSING_MARK = 7; |
|
|
|
|
|
|
|
*/ |
|
public static final byte COMBINING_SPACING_MARK = 8; |
|
|
|
|
|
|
|
*/ |
|
public static final byte CHAR_CATEGORY_COUNT = 30; |
|
|
|
|
|
|
|
|
|
*/ |
|
public static final int RIGHT_TO_LEFT = 1; |
|
|
|
|
|
|
|
*/ |
|
public static final int RIGHT_TO_LEFT_ARABIC = 13; |
|
} |