Posted on:
Last modified:
http://lucumr.pocoo.org/2014/1/9/ucs-vs-utf8/
from wikipedia
In the Unicode standard, a plane is a continuous group of 65536 code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–10 hexadecimal of the first two positions in six position format (hhhhhh).
Plane 0 is the Basic Multilingual Plane (BMP), which contains most commonly-used characters. The higher planes 1 through 16 are called "supplementary planes", As of Unicode version 9.0, six of the planes have assigned code points (characters), and four are named.
The limit of 17 (which is not a power of 2) is due to the design of UTF-16, which can encode a maximum value of 0x10FFFF,[2] the last code point in plane 16. The encoding scheme used by UTF-8 was designed with a much larger limit of 231 code points (32,768 planes), and can encode 221 code points (32 planes) even if limited to 4 bytes. Since Unicode limits the code points to the 17 planes that can be encoded by UTF-16, code points above 0x10FFFF are invalid in UTF-8 and UTF-32.
The 17 planes can accommodate 1,114,112 code points. Of these, 2,048 are surrogates, 66 are non-characters, and 137,468 are reserved for private use, leaving 974,530 for public assignment.
Planes are further subdivided into Unicode blocks, which, unlike planes, do not have a fixed size. The 273 blocks defined in Unicode 9.0 cover 24% of the possible code point space, and range in size from a minimum of 16 code points (twelve blocks) to a maximum of 65,536 code points (Supplementary Private Use Area-A and -B, which constitute the entirety of planes 15 and 16). For future usage, ranges of characters have been tentatively mapped out for most known current and ancient writing systems.
0000-007F ASCII
0080-1FFF 各种鸟语
2000-206F 常用标点 General Punctuation 包含双引号
2070-209F 上下标
20A0-20CF 货币符号
20D0-214F 各种物理符号
2150-218f 罗马数字
2190-21ff 箭头 https://en.wikipedia.org/wiki/Arrows_(Unicode_block) 有 emoji
2220-22FF 数学符号
2300-23ff 符号 Miscellaneous Technical 有 emoji
2400-245f 符号
2460-24ff 圆圈 Enclosed Alphanumerics
2500-257f 画方块字符 Box Drawing 好多人用做竖线
2580-259f 方块字符 Box Elements
25A0-2e7f 各种奇怪的字符,包含部分表情 emoji
3000-303f 中文符号和标点,包含了中括号等,竟然有双字节宽的引号
3040-33ff 各种中文符号,3190 也是竖线
4DC0-4DFF 八卦符号
4e00-9fff CJK 统一表意文字
A000-D7af 各种鸟语
D7B0-D7FF UTF-16 高半区,实际使用 D800-DBFF
DC00-DFFF 低半区
E000-F8FF 私用区,其中 F8FF 表示
F900-FAFF CJK 统一表意文字
FB00-FE0F 各种鸟语
FE10-FE1F 竖排符号
FE30-FE4F CJK 兼容标点
FE50-FE6F 小标点
FE70-FEFF 鸟语
FF00-FFEF 全角与半角
FFF0-FFFF 奇葩
需要注意的区域
1F000-1FFFF 各种表情 emoji
Emoji 有 text 和 emoji-style 两种形式,每个 emoji 有一个默认的形式,可以添加字符来强制指定形式:\ufe0e
\ufe0f
Emoji 可以表示不同的肤色,\u1F3FB–\u1F3FF
组合,Emoji 还可以组合成新的 emoji,这样来拟补不足,使用\u200d
http://www.unicode.org/emoji/charts/emoji-zwj-sequences.html
更多 emoji 表情参见:http://www.unicode.org/Public/emoji/1.0/emoji-data.txt
Surrogate to Non-Surrogate:
N = 0x10000 + (H - 0xd800) * 0x400 + (L - 0xDC00)
Non-Surrogate to Surrogate
H = (N - 0x10000) / 0x400 + 0xD800 L = (N - 0x10000) % 0x400 + 0xdc00
Python
narrow build python does not support SMP, python on mac are all narrow build.if you got narrow build, you will have to use to unicode char to represent.
常见问题
菊花文
https://zh.wikipedia.org/wiki/%E8%8F%8A%E8%8A%B1%E6%96%87
Unicode character orders
Left-to-Right Mark/Right-to-Left Mark
not very useful, fix puncutation positions
https://en.wikipedia.org/wiki/Right-to-left_mark
Left-to-right Order/Right-to-Left Order
This is very powerful, override normal character directions
U+202d LEFT-TO-RIGHT OVERRIDE The following text will be left-to-right. Additionally, the directionality of characters is changed to left-to-right. Used alone in an English text, this will only affect characters that are right-to-left by default, like Arabic letters. U+202e RIGHT-TO-LEFT OVERRIDE The following text will be right-to-left. Additionally, the directionality of characters is changed to right-to-left. Use this character to completely screw up an English text.
see https://www.explainxkcd.com/wiki/index.php/1137:_RTL https://www.zhihu.com/question/43621727/answer/96178474
white spaces
https://en.wikipedia.org/wiki/Whitespace_character
NFKC 和 NFKD 会把一些为了兼容性提供的字符转化成标准字符,比如罗马数字 I(1)
其实就是英文字母 I
。不过蛋疼的是,大部分中文标点也会被转化为英文标点,但是中文句号又不会被转化成英文句号。
In [13]: unicodedata.normalize("NFKC", "。")
Out[13]: '。'
In [14]: unicodedata.normalize("NFKC", "!")
Out[14]: '!'
In [15]: unicodedata.normalize("NFKC", "?")
Out[15]: '?'
© 2016-2022 Yifei Kong. Powered by ynotes
All contents are under the CC-BY-NC-SA license, if not otherwise specified.
Opinions expressed here are solely my own and do not express the views or opinions of my employer.
友情链接: MySQL 教程站