unicode编码的分布
unicode编码的分布
2012-03-19 整理自:http://topic.csdn.net/u/20080629/00/2f669f44-6e30-4e2e-9cce-08889dba2ae2.html
------------------------------------------------------------------------
? 0000..007F; Basic Latin ?
? 0080..00FF; Latin-1 Supplement ?
? 0100..017F; Latin Extended-A ?
? 0180..024F; Latin Extended-B ?
? 0250..02AF; IPA Extensions ?
? 02B0..02FF; Spacing Modifier Letters ?
? 0300..036F; Combining Diacritical Marks ?
? 0370..03FF; Greek ?
? 0400..04FF; Cyrillic ?
? 0530..058F; Armenian ?
? 0590..05FF; Hebrew ?
? 0600..06FF; Arabic ?
? 0700..074F; Syriac ?
? 0780..07BF; Thaana ?
? 0900..097F; Devanagari ?
? 0980..09FF; Bengali ?
? 0A00..0A7F; Gurmukhi ?
? 0A80..0AFF; Gujarati ?
? 0B00..0B7F; Oriya ?
? 0B80..0BFF; Tamil ?
? 0C00..0C7F; Telugu ?
? 0C80..0CFF; Kannada ?
? 0D00..0D7F; Malayalam ?
? 0D80..0DFF; Sinhala ?
? 0E00..0E7F; Thai ?
? 0E80..0EFF; Lao ?
? 0F00..0FFF; Tibetan ?
? 1000..109F; Myanmar ?
? 10A0..10FF; Georgian ?
? 1100..11FF; Hangul Jamo ?
? 1200..137F; Ethiopic ?
? 13A0..13FF; Cherokee ?
? 1400..167F; Unified Canadian Aboriginal Syllabics ?
? 1680..169F; Ogham ?
? 16A0..16FF; Runic ?
? 1780..17FF; Khmer ?
? 1800..18AF; Mongolian ?
? 1E00..1EFF; Latin Extended Additional ?
? 1F00..1FFF; Greek Extended ?
? 2000..206F; General Punctuation ?
? 2070..209F; Superscripts and Subscripts ?
? 20A0..20CF; Currency Symbols ?
? 20D0..20FF; Combining Marks for Symbols ?
? 2100..214F; Letterlike Symbols ?
? 2150..218F; Number Forms ?
? 2190..21FF; Arrows ?
? 2200..22FF; Mathematical Operators ?
? 2300..23FF; Miscellaneous Technical ?
? 2400..243F; Control Pictures ?
? 2440..245F; Optical Character Recognition ?
? 2460..24FF; Enclosed Alphanumerics ?
? 2500..257F; Box Drawing ?
? 2580..259F; Block Elements ?
? 25A0..25FF; Geometric Shapes ?
? 2600..26FF; Miscellaneous Symbols ?
? 2700..27BF; Dingbats ?
? 2800..28FF; Braille Patterns ?
? 2E80..2EFF; CJK Radicals Supplement ?
? 2F00..2FDF; Kangxi Radicals ?
? 2FF0..2FFF; Ideographic Description Characters ?
? 3000..303F; CJK Symbols and Punctuation ?
? 3040..309F; Hiragana ?
? 30A0..30FF; Katakana ?
? 3100..312F; Bopomofo ?
? 3130..318F; Hangul Compatibility Jamo ?
? 3190..319F; Kanbun ?
? 31A0..31BF; Bopomofo Extended ?
? 3200..32FF; Enclosed CJK Letters and Months ?
? 3300..33FF; CJK Compatibility ??????????????????????????????????? //中文字符开始
? 3400..4DB5; CJK Unified Ideographs Extension A ?
? 4E00..9FFF; CJK Unified Ideographs ????????????????????????? //中文字符结束
? A000..A48F; Yi Syllables ?
? A490..A4CF; Yi Radicals ?
? AC00..D7A3; Hangul Syllables ?
? D800..DB7F; High Surrogates ?
? DB80..DBFF; High Private Use Surrogates ?
? DC00..DFFF; Low Surrogates ?
? E000..F8FF; Private Use ?
? F900..FAFF; CJK Compatibility Ideographs ?
? FB00..FB4F; Alphabetic Presentation Forms ?
? FB50..FDFF; Arabic Presentation Forms-A ?
? FE20..FE2F; Combining Half Marks ?
? FE30..FE4F; CJK Compatibility Forms ?
? FE50..FE6F; Small Form Variants ?
? FE70..FEFE; Arabic Presentation Forms-B ?
? FEFF..FEFF; Specials ?
? FF00..FFEF; Halfwidth and Fullwidth Forms ?
? FFF0..FFFD; Specials ?
? 10300..1032F; Old Italic ?
? 10330..1034F; Gothic ?
? 10400..1044F; Deseret ?
? 1D000..1D0FF; Byzantine Musical Symbols ?
? 1D100..1D1FF; Musical Symbols ?
? 1D400..1D7FF; Mathematical Alphanumeric Symbols ?
? 20000..2A6D6; CJK Unified Ideographs Extension B ?
? 2F800..2FA1F; CJK Compatibility Ideographs Supplement ?
? E0000..E007F; Tags ?
? F0000..FFFFD; Private Use ?
? 100000..10FFFD; Private Use ?
---------------------------------------------------------------------------
Unicode CJK(中文字符) 的范围分布在多个区段中,带有 CJK 的区块名中都拥有汉字。最常用的范围是 U+4E00~U+9FA5,即名为:CJK Unified Ideographs 的区块,但 U+9FA6~U+9FFF 之间的字符还属于空码,
暂时还未定义,但不能保证以后不会被定义。
PS:Unicode 中 U+4E00~U+9FFF 的码表:
http://www.unicode.org/charts/PDF/U4E00.pdf
在这里可以根据 Unicode 码查到所有的字符:
http://www.unicode.org/cgi-bin/GetUnihanData.pl
另:在正则表达式中使用 [\u4e00-\u9fa5] 这种方式属于写死的代码,并不能根据
平台所提供的字符集范围不同而改变,不过对于要求不是很高的话的是可以了。如果
对字符集的要求很高,可以采用下面的这种 Unicode 块的方式:
在当前的 JDK 版中与 [\u4e00-\u9fa5] 的意义一致。但这样可以匹配 Java 平台所支持
Unicode 块名为 CJK Unified Ideogrpahs 中已定义的字符,这种方式就属于“活”代码
今后的 JDK 版本升级了,定义到了 \u9fa6 的字符,同样能够满足匹配。