unicode编码与utf-8 区别

unicode编码与utf-8 区别

如果是为了跨平台兼容性，只需要知道，在 Windows 记事本的语境中：

所谓的「ANSI」指的是对应当前系统 locale 的遗留（legacy）编码。[1]
所谓的「Unicode」指的是带有 BOM 的小端序 UTF-16。[2]
所谓的「UTF-8」指的是带 BOM 的 UTF-8。[3]

举一个例子：It's 知乎日报

你看到的unicode字符集是这样的编码表：

每一个字符对应一个十六进制数字。

计算机只懂二进制，因此，严格按照unicode的方式(UCS-2)，应该这样存储：

I 00000000 01001001

t 00000000 01110100

' 00000000 00100111

s 00000000 01110011

  00000000 00100000

知 01110111 11100101

乎 01001110 01001110

日 01100101 11100101

报 01100010 10100101

这个字符串总共占用了18个字节，但是对比中英文的二进制码，可以发现，英文前9位都是0！浪费啊，浪费硬盘，浪费流量。

怎么办？

UTF。

UTF-8是这样做的：

1. 单字节的字符，字节的第一位设为0，对于英语文本，UTF-8码只占用一个字节，和ASCII码完全相同；

2. n个字节的字符(n>1)，第一个字节的前n位设为1，第n+1位设为0，后面字节的前两位都设为10，这n个字节的其余空位填充该字符unicode码，高位用0补足。

这样就形成了如下的UTF-8标记位：

0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
... ...

于是，”It's 知乎日报“就变成了：

I 01001001

t 01110100

' 00100111

s 01110011

  00100000

知 11100111 10011111 10100101

乎 11100100 10111001 10001110

日 11100110 10010111 10100101

报 11100110 10001010 10100101

和上边的方案对比一下，英文短了，每个中文字符却多用了一个字节。但是整个字符串只用了17个字节，比上边的18个短了一点点。

下边是课后作业：

请将”It's 知乎日报“的GB2312和GBK码(自行google)转成二进制。不考虑历史因素，从技术角度解释为什么在unicode和UTF-8大行其道的同时，GB2312和GBK仍在广泛使用。

剧透：一切都是为了节省你的硬盘和流量。

「带 BOM 的 UTF-8」和「无 BOM 的 UTF-8」有什么区别

http://codesnipers.com/?q=node/68

With the explosion of international text resources brought by the Internet, the standards for determining file encodings have become more important. This is my attempt at making the text file encoding issues digestible by leaving out some of the unimportant anecdotal stuff. I'm also calling attention to blunders in the MSDN docs.

For Unicode files, the BOM ("Byte Order Mark" also called the signature or preamble) is a set of 2 or so bytes at the beginning used to indicate the type of Unicode encoding. The key to the BOM is that it is generally not included with the content of the file when the file's text is loaded into memory, but it may be used to affect how the file is loaded into memory. Here are the most important BOMs and the encodings they indicate:

FF FE UCS-2LE or UTF-16LE
FE FF UCS-2BE or UTF-16BE
EF BB BF UTF-8

BOM stands for Byte Order Mark which literally is meant to distinguish between little-endian LE and big-endian BE byte order by representing the code point U+FEFF ("zero width no-break space"). Since 0xFFFE (bytes swapped) is not a valid Unicode character, you can be sure of the byte order. Note that the BOM does not distinguish between UCS-2 and UTF-16 (they are the same except that UTF-16 has surrogate pairs to represent more code points). The UTF-8 BOM is not concerned with "byte order," but it is a unique enough sequence (representing the same U+FEFF code point) that would be very unlikely to be found in a non-UTF-8 file, thus distinguishing UTF-8 especially from other "ASCII-based" encodings.

The BOM is not guaranteed to be in every Unicode file. In fact the BOM is not necessary in XML files because the Unicode encoding can be determined from the leading less than sign. If the first byte is an ASCII less than sign 3C followed by a non-zero byte then the file is assumed to be UTF-8 until otherwise specified in the XML Declaration. If the XML file begins with 3C 00 it is UCS-2LE or UTF-16LE (normal for Windows Unicode files). If the XML file begins with 00 3C it is UCS-2BE or UTF-16BE. In summary then:

3C 00 UCS-2LE or UTF-16LE
00 3C UCS-2BE or UTF-16BE
3C XX UTF-8 (where XX is non-zero)
Unlike the BOM, these bytes are part of the text content of the file.

Also, the BOM is not used to determine non-Unicode encodings such as Windows-1250, GB2312, Shift-JIS, or even EBCDIC. Generally you should have some way of knowing the encoding due to the context of your situation. However, markup languages have ways of specifying the encoding in the markup near the top of the file. The following appears inside the <head> area of an HTML page:

In XML, the XML declaration specifies the encoding.

<?xml version="1.0" encoding="iso-8859-1"?>

This works because all of the encodings are compatible enough with ASCII in the lower 128, so the encoding can be specified as an ASCII string. The alternative to ASCII is EBCDIC; if an XML file begins with the EBCDIC less than sign 4C it is treated as EBCDIC and the XML declaration (if present) would be processed in standard EBCDIC to determine the extended EBCDIC variant.

If I hadn't seen fundamental problems in Microsoft documentation before it would be inconceivable to me that in articles about the Unicode BOM, the BOM is shown backwards, but it is true! Both the Byte-order Mark and IsTextUnicode MSDN articles get the UTF-16 BOM bytes backwards. Herfried K. Wagner [MVP] agreed that the documentation is wrong (plain as day but it is still good to be MVP corroborated).

Another thing that irks me is the IsTextUnicode API. This API is so useless it bears mentioning. First of all, it does not deal with UTF-8, so it can only be one part of the way Notepad detects Unicode. Raymond Chen said it is used in Notepad but Notepad actually auto-detects UTF-8 as explained below (his blog, The Old New Thing is one of my three favorites). Secondly, when text is in memory, it should usually not have the BOM with it since it can only be efficient to interrogate the BOM before loading the file. Thirdly, if you cannot guarantee the statistical analysis, you are better off using a simple deterministic mechanism that you can explain to the end user so that they don't experience seemingly erratic behavior.

UTF-8 auto-detection (when there is no UTF-8 BOM) is not widely documented (although it is exhibited by Windows Notepad). UTF-8 multi-byte sequences have a pattern that would be almost impossible to accidentally produce in another encoding. The more non-ASCII characters a UTF-8 file has, the more certain you can be that it is UTF-8, but my impression is that you don't need many.

Without going into the full algorithm, just perform UTF-8 decoding on the file looking for an invalid UTF-8 sequence. The correct UTF-8 sequences look like this:

0xxxxxxx ASCII < 0x80 (128)
110xxxxx 10xxxxxx 2-byte >= 0x80
1110xxxx 10xxxxxx 10xxxxxx 3-byte >= 0x400
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 4-byte >= 0x10000
If the file is all ASCII, you cannot know if it was meant to be UTF-8 or any other character set compatible with ASCII in the lower 128.

Without contextual information, a BOM, or a file type standard with a header like XML and HTML, a file should be assumed to be in the default system locale ANSI code page, governed by the Language for non-Unicode Programs in the Regional Settings on the computer on which it is found.

// Define UTF-8 header

const char szUTF8Head[] = {(BYTE)0xEF,(BYTE)0xBB,(BYTE)0xBF,0x00};

const DWORD dwUTF8HeadLen = sizeof(szUTF8Head)/sizeof(char)-;

//  Skip an optional UTF-8 header

if (strncmp(linebuffer,szUTF8Head,dwUTF8HeadLen) == )

unicode编码与utf-8 区别的更多相关文章

java基础类型中的char和byte的辨析及Unicode编码和UTF-8的区别
在平常工作中使用到char和byte的场景不多,但是如果项目中使用到IO流操作时,则必定会涉及到这两个类型,下面让我们一起来回顾一下这两个类型吧. char和byte的对比 byte byte 字节, ...
BIG5, GB(GB2312, GBK, ...), Unicode编码, UTF8, WideChar, MultiByte, Char说明与区别
汉语unicode编译方式,BIG5是繁体规范,GB是简体规范 GB是大陆使用的国标码,BIG5码,又叫大五码,是台湾使用的繁体码. BIG5编码, GB编码(GB2312, GBK, ...), U ...
Qt中文编码和QString类Unicode编码转换
版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明. 本文链接:https://blog.csdn.net/g423tgl234/article ...
ASCII\UNICODE编码的区别
前几天,Google给我Hotmail邮箱发了封确认信.我看不懂,不是因为我英文不行,而是"???? ????? ??? ????"的内容让我不知所措.有好多程序员处理不好编码问题 ...
转载：谈谈Unicode编码，简要解释UCS、UTF、BMP、BOM等名词
转载: 谈谈Unicode编码,简要解释UCS.UTF.BMP.BOM等名词这是一篇程序员写给程序员的趣味读物.所谓趣味是指可以比较轻松地了解一些原来不清楚的概念,增进知识,类似于打RPG游戏的升级 ...
Windows 记事本的 ANSI、Unicode、UTF-8 这三种编码模式有什么区别？
[梁海的回答(99票)]: 简答.一些细节暂无精力查证,如果说错了还请指出. 一句话建议:涉及兼容性考量时,不要用记事本,用专业的文本编辑器保存为不带 BOM 的UTF-8. * * * 如果是为了跨 ...
谈谈Unicode编码，简要解释UCS、UTF、BMP、BOM等名词
这是一篇程序员写给程序员的趣味读物.所谓趣味是指可以比较轻松地了解一些原来不清楚的概念,增进知识,类似于打RPG游戏的升级.整理这篇文章的动机是两个问题: 问题一: 使用Windows记事本的“另存为 ...
ASCII编码和Unicode编码的区别
链接: 计算机只能处理数字,如果要处理文本,就必须先把文本转换为数字才能处理.Unicode把所有语言都统一到一套编码里,这样就不会再有乱码问题了.Unicode标准也在不断发展,但最常用的是用两个字 ...
Unicode编码，解释UCS、UTF、BMP、BOM等名词
(转载谈谈Unicode编码,简要解释UCS.UTF.BMP.BOM等名词这是一篇程序员写给程序员的趣味读物.所谓趣味是指可以比较轻松地了解一些原来不清楚的概念,增进知识,类似于打RPG游戏的升级 ...

随机推荐

(转)Linux grep
文章转自 http://www.cnblogs.com/ggjucheng/archive/2013/01/13/2856896.html 简介 grep (global search regular ...
Hook机制里登场的角色
稍有接触过 WordPress 主题或插件制作修改的朋友,对 WordPress 的Hook机制应该不陌生,但通常刚接触WordPress Hook 的新手,对其运作原理可能会有点混乱或模糊.本文针对 ...
泛型数组列表 ArrayList
为什么使用泛型数组列表而不使用普通数组? 1.普通数组经常会发生容量太大以致浪费的情况 2.普通数组无法动态更改数组基本概念: 1.采用[类型参数]的[类]---->[泛型类] 2.[泛型类型 ...
python成长之路【第八篇】：异常处理
一.异常基础在编程过程中为了增加友好性,在程序出现bug时一般不会将错误信息显示给用户,而是现实一个提示的页面,通俗来说就是不让用户看见大黄页!!! 语法: try: pass except Exc ...
关于HTML5你必须知道的28个新特性,新技巧以及新技术
1. 新的Doctype 尽管使用<!DOCTYPE html>,即使浏览器不懂这句话也会按照标准模式去渲染 2. Figure元素用<figure>和<figcapt ...
如何解决结果由block返回情况下的同步问题(转)
开发中经常会遇到一种简单的同步问题: 系统在获取资源时,采用了block写法,外部逻辑需要的结果是在block回调中返回的举个例子: 请求获取通讯录权限的系统弹窗调用系统方法请求通讯录权限: AB ...
vscode 与 python 的约会
安装python 官网(https://www.python.org/downloads/)下载, 安装. (简单略过). 运行python代码运行python代码的常见方式有三种: 运行pytho ...
ALT+TAB切换时小图标的添加界面透明屏幕大小竖行字体进程信息
一,ALT+TAB切换时小图标的添加 Dlg类中添加变量 protected: HICON m_hIcon; #define IDR_MAINFRAME 128 ICON IDR_MAINFRAME, ...
让Git记住用户名和密码
user/username/.gitconfig [credential] helper = store
一个靠谱的国外maven镜像地址
<mirror> <id>ui</id> <mirrorOf>central</mirrorOf> <name>Human Re ...

unicode编码与utf-8 区别

unicode编码与utf-8 区别的更多相关文章

随机推荐

热门专题