Unicode Character Set and UTF-8, UTF-16, UTF-32 Encoding
在计算机内存中,统一使用unicode编码,当需要保存到硬盘或者需要传输的时候,就转换为utf-8编码。
用记事本编辑的时候,从文件读取的utf-8字符被转换为unicode字符到内存里,编码完成保存时再把unicode转换为utf-8保存到文件。
浏览网页时,服务器会把动态生成的unicode内容转换为utf-8再传输给浏览器,所以会看到许多网页的源码上会有类似<meta charset="UTF-8" />的信息,表示该网页正是用的utf-8编码。
转自:https://naveenr.net/unicode-character-set-and-utf-8-utf-16-utf-32-encoding/
ASCII
In the older days of computing, ASCII code was used to represent characters. The English language has only 26 alphabets and a few other special characters and symbols.
The table below provides the ASCII characters and their corresponding Decimal and Hex values.
As you can infer from the above table, the ASCII values can be represented from 0 to 127 in the decimal number system. Lets look at the binary representation of 0 and 127 in 8 bit bytes.
0 is represented as
127 is represented as
It can be inferred from the above binary representation that decimal values 0 to 127 can be represented using 7 bits leaving the 8th bit free.
This is where things started getting messy.
People came up with different ways of using the remaining eight bit which represented decimal values from 128 to 255 and collisions started to happen. For instance the decimal value 182 was used by the Vietnamese to represent the Vietnamese alphabet ờ whereas the same value 182 was used by the Indians to represent the Hindi alphabet घ. So if an email written by an Indian contains the alphabet घ and if it is read by a person in Vietnam it would appear as ờ. Cleary not the intended way to appear.
This is where Unicode character set came to save the day.
Unicode and Code Points
Unicode character set mapped each character in the world to a unique number. This ensured that there are no collisions between alphabets of different languages. These numbers are platform independent.
These unique numbers are called as code points in the unicode terminology.
Lets see how they are referred.
The latin character ṍ is referred using the code point
U+1E4D
U+ denotes unicode and 1E4D is the hexadecimal value assigned to the character ṍ
The English alphabet A is represented as U+0041
Please visit http://www.unicode.org/charts/ to know the code points for all languages and alphabets of the world
UTF-8 Encoding
Now that we know what is unicode and how each alphabet in the world is assigned to a unique code point, we need a way to represent these code points in the computer's memory. This is where character encodings come into the picture. One such encoding scheme is UTF-8.
UTF-8 encoding is a variable sized encoding scheme to represent unicode code points in memory. Variable sized encoding means the code points are represented using 1, 2, 3 or 4 bytes depending on their size.
UTF-8 1 byte encoding
A 1 byte encoding is identified by the presence of 0 in the first bit.
The English alphabet A has unicode code point U+0041. It's binary representation is 1000001.
A is represented in UTF-8 encoding as
01000001
The red 0 bit indicates that 1 byte encoding is used and the remaining bits represent the code point
UTF-8 2 byte encoding
The latin alphabet ñ with code point U+00F1 has binary value 11110001. This value is larger than the maximum value that can be represented using 1 byte encoding format and hence this alphabet will be represented using UTF-8 2 byte encoding.
2 byte encoding is identified by the presence of the bit sequence 110 in the first bit and 10 in the second bit.
The binary value of the unicode code point U+00F1 is 1111 0001. Filling these bits in the 2 byte encoding format, we get the UTF-8 2 byte encoding representation of ñ shown below. The filling is done starting with the least significant bit of the code point being mapped to the least significant bit of the second byte.
11000011 10110001
The binary digits in blue 11110001 represent the code point U+00F1's binary value and the ones in red are the 2 byte encoding identifiers. The black coloured zeros are used to fill up the empty bits in the byte.
UTF-8 3 byte encoding
The latin character ṍ with code point U+1E4D is be represented using 3 byte encoding as it is larger than the maximum value that can be represented using 2 byte encoding.
A 3 byte encoding is identified by the presence of the bit sequence 1110 in the first byte and 10 in the second and third bytes.
The binary value for the hex code point 0x1E4D is 1111001001101. Filling these bits in the above encoding format gives us the UTF-8 3 byte encoding representation of ṍ show below. The filling is done starting with the least significant bit of the code point mapped to the least significant of the third byte.
11100001 10111001 10001101
The red bits indicate 3 byte encoding, the black ones are filler bits and the blues represent the code point.
UTF-8 4 byte encoding
The Emoji
Unicode Character Set and UTF-8, UTF-16, UTF-32 Encoding的更多相关文章
- Unicode Character Table – Unicode 字符大全
Unicode(统一码.万国码.单一码)是一种在计算机上使用的字符编码.它为每种语言中的每个字符设定了统一并且唯一的二进制编码,以满足跨语言.跨平台进行文本转换.处理的要求.Unicode Chara ...
- nginx启动报错(1113: No mapping for the Unicode character exists in the target multi-byte code page)
使用windows版本的nginx启动时遇到(1113: No mapping for the Unicode character exists in the target multi-byte co ...
- failed (1113: No mapping for the Unicode character exists in the target multi-byte code page), client: 127.0.0.1...
nginx部署网站后,访问域名,网页显示 500 Internal Server Error ,经查看发现nginx的error.log中有报错: failed (1113: No mapping ...
- nginx 启动报错 1113: No mapping for the Unicode character exists in the target multi-byte code
failed (1113: No mapping for the Unicode character exists in the target multi-byte code page) 因为路径有中 ...
- Windows版Nginx启动失败之1113: No mapping for the Unicode character exists in the target multi-byte code page
Windows版Nginx启动一闪,进程中未发现nginx进程,查看nginx日志,提示错误为1113: No mapping for the Unicode character exists in ...
- Ansi、GB2312、GBK、Unicode(utf8、16、32)
关于ansi,一般默认为本地编码方式,中文应该是gb编码 他们之间的关系在这边文章里描写的很清楚:http://blog.csdn.net/ldanduo/article/details/820353 ...
- IIS7的FTP出错: 451 No mapping for the unicode character exists in the target multi-byte code page
提示:IIS7的FTP出错: 451 No mapping for the unicode character exists in the target multi-byte code page 今天 ...
- Multi-Byte Character Set & Unicode Character Set
本系列文章由 @YhL_Leo 出品,转载请注明出处. 文章链接: http://blog.csdn.net/yhl_leo/article/details/49592361 编程时遇到BUG:err ...
- 外设位宽为8、16、32时,CPU与外设之间地址线的连接方法
有不少人问到:flash连接CPU时,根据不同的数据宽度,比如16位的NOR FLASH (A0-A19),处理器的地址线要(A1-A20)左移偏1位.为什么要偏1位? (全文有点晦涩,建议收藏本文对 ...
随机推荐
- Linux中在防火墙中开启80端口的例子
最近自己在学习Linux.搭建一个LNMP环境.在测试时一切都好.然后重启Linux后.再次访问网站无法打开.最终原因是在防火墙中没有加入 80 端口的规则.具体方法如下: 在CentOS下配置ipt ...
- Ubuntu局域网下利用client联网
Ubuntu是一个非常好的Linux操作系统,可是对于刚刚安装使用它的新手来说如何用Ubuntu连入网络却是一大难关.如今就记录一下自己在Ubuntu下上网的过程. ★client 将client解压 ...
- 增量式pid和位置式PID参数整定过程对比
//增量式PID float IncPIDCalc(PID_Typedef* PIDx,float SetValue,float MeaValue)//err»ý·Ö·ÖÀë³£Êý { PIDx-& ...
- Oracle的执行计划(来自百度文库)
如何开启oracle执行计划 http://wenku.baidu.com/view/7d1ff6bc960590c69ec37636.html怎样看懂Oracle的执行计划 http://wenku ...
- Web用户控件开发--分页控件
分页是Web应用程序中最常用到的功能之一,在ASP.NET中,虽然自带了一些可以分页的数据控件,但其分页功能并不尽如人意.本文对于这些数据控件的假分页暂且不表,如有不明白的同学请百Google度之. ...
- DCM 图片查看
因为要处理一些医学图像,需要把dcm格式的文件转换成jpg格式.本来用Sante DICOM Editor用得挺好的,方便查看dcm文件,但是在转换上每次只能转一张(本人没有找到用该软件批量转格式的方 ...
- 温故而知新:柯里化 与 bind() 的认知
什么是柯里化?科里化是把一个多参数函数转化为一个嵌套的一元函数的过程.(简单的说就是将函数的参数,变为多次入参) const curry = (fn, ...args) => fn.length ...
- java web项目中打开资源文件中文乱码
1 java web项目中经常使用多模块管理.在某一个模块中添加了一些资源文件.但不是启动项目.有时候需要在程序中读取资源文件内容,打包后放到容器中就不能正常运行了.需要将所有资源文件放到启动项目的 ...
- MySQL通过Explain查看select语句的执行计划结果触发写操作
[背景] 某某同学执行了一下Explain结果结果发现数据库有了一条写入操作,恭喜这位同学你的锅到货了,你签收一下: 对! 你没有听错,在一种场景下就算是Explain也会引发数据的写操作,就这是外层 ...
- FTP整站上传的批处理脚本
一个FTP整站上传的批处理代码.例子: @echo off rem 设置FTP服务器地址 set ftpIP=192.168.0.2 rem 设置FTP用户名 set ftpUser=MyUser r ...