• unicode编码与utf-8 区别

如果是为了跨平台兼容性,只需要知道,在 Windows 记事本的语境中:

  • 所谓的「ANSI」指的是对应当前系统 locale 的遗留(legacy)编码。[1]
  • 所谓的「Unicode」指的是带有 BOM 的小端序 UTF-16。[2]
  • 所谓的「UTF-8」指的是带 BOM 的 UTF-8。[3]
举一个例子:It's 知乎日报

你看到的unicode字符集是这样的编码表:

I 0049
t 0074
' 0027
s 0073
0020
知 77e5
乎 4e4e
日 65e5
报 62a5

每一个字符对应一个十六进制数字。

计算机只懂二进制,因此,严格按照unicode的方式(UCS-2),应该这样存储:

I 00000000 01001001
t 00000000 01110100
' 00000000 00100111
s 00000000 01110011
00000000 00100000
知 01110111 11100101
乎 01001110 01001110
日 01100101 11100101
报 01100010 10100101

这个字符串总共占用了18个字节,但是对比中英文的二进制码,可以发现,英文前9位都是0!浪费啊,浪费硬盘,浪费流量。

怎么办?

UTF。

UTF-8是这样做的:

1. 单字节的字符,字节的第一位设为0,对于英语文本,UTF-8码只占用一个字节,和ASCII码完全相同;

2. n个字节的字符(n>1),第一个字节的前n位设为1,第n+1位设为0,后面字节的前两位都设为10,这n个字节的其余空位填充该字符unicode码,高位用0补足。

这样就形成了如下的UTF-8标记位:

0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
... ...

于是,”It's 知乎日报“就变成了:

I 01001001
t 01110100
' 00100111
s 01110011
00100000
知 11100111 10011111 10100101
乎 11100100 10111001 10001110
日 11100110 10010111 10100101
报 11100110 10001010 10100101

和上边的方案对比一下,英文短了,每个中文字符却多用了一个字节。但是整个字符串只用了17个字节,比上边的18个短了一点点。

下边是课后作业:

请将”It's 知乎日报“的GB2312和GBK码(自行google)转成二进制。不考虑历史因素,从技术角度解释为什么在unicode和UTF-8大行其道的同时,GB2312和GBK仍在广泛使用。

剧透:一切都是为了节省你的硬盘和流量。

 
  • 「带 BOM 的 UTF-8」和「无 BOM 的 UTF-8」有什么区别

http://codesnipers.com/?q=node/68

With the explosion of international text resources brought by the Internet, the standards for determining file encodings have become more important. This is my attempt at making the text file encoding issues digestible by leaving out some of the unimportant anecdotal stuff. I'm also calling attention to blunders in the MSDN docs.

For Unicode files, the BOM ("Byte Order Mark" also called the signature or preamble) is a set of 2 or so bytes at the beginning used to indicate the type of Unicode encoding. The key to the BOM is that it is generally not included with the content of the file when the file's text is loaded into memory, but it may be used to affect how the file is loaded into memory. Here are the most important BOMs and the encodings they indicate:

FF FE UCS-2LE or UTF-16LE
FE FF UCS-2BE or UTF-16BE
EF BB BF UTF-8

BOM stands for Byte Order Mark which literally is meant to distinguish between little-endian LE and big-endian BE byte order by representing the code point U+FEFF ("zero width no-break space"). Since 0xFFFE (bytes swapped) is not a valid Unicode character, you can be sure of the byte order. Note that the BOM does not distinguish between UCS-2 and UTF-16 (they are the same except that UTF-16 has surrogate pairs to represent more code points). The UTF-8 BOM is not concerned with "byte order," but it is a unique enough sequence (representing the same U+FEFF code point) that would be very unlikely to be found in a non-UTF-8 file, thus distinguishing UTF-8 especially from other "ASCII-based" encodings.

The BOM is not guaranteed to be in every Unicode file. In fact the BOM is not necessary in XML files because the Unicode encoding can be determined from the leading less than sign. If the first byte is an ASCII less than sign 3C followed by a non-zero byte then the file is assumed to be UTF-8 until otherwise specified in the XML Declaration. If the XML file begins with 3C 00 it is UCS-2LE or UTF-16LE (normal for Windows Unicode files). If the XML file begins with 00 3C it is UCS-2BE or UTF-16BE. In summary then:

3C 00 UCS-2LE or UTF-16LE
00 3C UCS-2BE or UTF-16BE
3C XX UTF-8 (where XX is non-zero)
Unlike the BOM, these bytes are part of the text content of the file.

Also, the BOM is not used to determine non-Unicode encodings such as Windows-1250, GB2312, Shift-JIS, or even EBCDIC. Generally you should have some way of knowing the encoding due to the context of your situation. However, markup languages have ways of specifying the encoding in the markup near the top of the file. The following appears inside the <head> area of an HTML page:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

In XML, the XML declaration specifies the encoding.

<?xml version="1.0" encoding="iso-8859-1"?>

This works because all of the encodings are compatible enough with ASCII in the lower 128, so the encoding can be specified as an ASCII string. The alternative to ASCII is EBCDIC; if an XML file begins with the EBCDIC less than sign 4C it is treated as EBCDIC and the XML declaration (if present) would be processed in standard EBCDIC to determine the extended EBCDIC variant.

If I hadn't seen fundamental problems in Microsoft documentation before it would be inconceivable to me that in articles about the Unicode BOM, the BOM is shown backwards, but it is true! Both the Byte-order Mark and IsTextUnicode MSDN articles get the UTF-16 BOM bytes backwards. Herfried K. Wagner [MVP] agreed that the documentation is wrong (plain as day but it is still good to be MVP corroborated).

Another thing that irks me is the IsTextUnicode API. This API is so useless it bears mentioning. First of all, it does not deal with UTF-8, so it can only be one part of the way Notepad detects Unicode. Raymond Chen said it is used in Notepad but Notepad actually auto-detects UTF-8 as explained below (his blog, The Old New Thing is one of my three favorites). Secondly, when text is in memory, it should usually not have the BOM with it since it can only be efficient to interrogate the BOM before loading the file. Thirdly, if you cannot guarantee the statistical analysis, you are better off using a simple deterministic mechanism that you can explain to the end user so that they don't experience seemingly erratic behavior.

UTF-8 auto-detection (when there is no UTF-8 BOM) is not widely documented (although it is exhibited by Windows Notepad). UTF-8 multi-byte sequences have a pattern that would be almost impossible to accidentally produce in another encoding. The more non-ASCII characters a UTF-8 file has, the more certain you can be that it is UTF-8, but my impression is that you don't need many.

Without going into the full algorithm, just perform UTF-8 decoding on the file looking for an invalid UTF-8 sequence. The correct UTF-8 sequences look like this:

0xxxxxxx ASCII < 0x80 (128)
110xxxxx 10xxxxxx 2-byte >= 0x80
1110xxxx 10xxxxxx 10xxxxxx 3-byte >= 0x400
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 4-byte >= 0x10000
If the file is all ASCII, you cannot know if it was meant to be UTF-8 or any other character set compatible with ASCII in the lower 128.

Without contextual information, a BOM, or a file type standard with a header like XML and HTML, a file should be assumed to be in the default system locale ANSI code page, governed by the Language for non-Unicode Programs in the Regional Settings on the computer on which it is found.

// Define UTF-8 header
const char szUTF8Head[] = {(BYTE)0xEF,(BYTE)0xBB,(BYTE)0xBF,0x00};
const DWORD dwUTF8HeadLen = sizeof(szUTF8Head)/sizeof(char)-; // Skip an optional UTF-8 header
if (strncmp(linebuffer,szUTF8Head,dwUTF8HeadLen) == )

unicode编码与utf-8 区别的更多相关文章

  1. java基础类型中的char和byte的辨析及Unicode编码和UTF-8的区别

    在平常工作中使用到char和byte的场景不多,但是如果项目中使用到IO流操作时,则必定会涉及到这两个类型,下面让我们一起来回顾一下这两个类型吧. char和byte的对比 byte byte 字节, ...

  2. BIG5, GB(GB2312, GBK, ...), Unicode编码, UTF8, WideChar, MultiByte, Char说明与区别

    汉语unicode编译方式,BIG5是繁体规范,GB是简体规范 GB是大陆使用的国标码,BIG5码,又叫大五码,是台湾使用的繁体码. BIG5编码, GB编码(GB2312, GBK, ...), U ...

  3. Qt中文编码和QString类Unicode编码转换

      版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明. 本文链接:https://blog.csdn.net/g423tgl234/article ...

  4. ASCII\UNICODE编码的区别

    前几天,Google给我Hotmail邮箱发了封确认信.我看不懂,不是因为我英文不行,而是"???? ????? ??? ????"的内容让我不知所措.有好多程序员处理不好编码问题 ...

  5. 转载:谈谈Unicode编码,简要解释UCS、UTF、BMP、BOM等名词

    转载: 谈谈Unicode编码,简要解释UCS.UTF.BMP.BOM等名词 这是一篇程序员写给程序员的趣味读物.所谓趣味是指可以比较轻松地了解一些原来不清楚的概念,增进知识,类似于打RPG游戏的升级 ...

  6. Windows 记事本的 ANSI、Unicode、UTF-8 这三种编码模式有什么区别?

    [梁海的回答(99票)]: 简答.一些细节暂无精力查证,如果说错了还请指出. 一句话建议:涉及兼容性考量时,不要用记事本,用专业的文本编辑器保存为不带 BOM 的UTF-8. * * * 如果是为了跨 ...

  7. 谈谈Unicode编码,简要解释UCS、UTF、BMP、BOM等名词

    这是一篇程序员写给程序员的趣味读物.所谓趣味是指可以比较轻松地了解一些原来不清楚的概念,增进知识,类似于打RPG游戏的升级.整理这篇文章的动机是两个问题: 问题一: 使用Windows记事本的“另存为 ...

  8. ASCII编码和Unicode编码的区别

    链接: 计算机只能处理数字,如果要处理文本,就必须先把文本转换为数字才能处理.Unicode把所有语言都统一到一套编码里,这样就不会再有乱码问题了.Unicode标准也在不断发展,但最常用的是用两个字 ...

  9. Unicode编码,解释UCS、UTF、BMP、BOM等名词

    (转载 谈谈Unicode编码,简要解释UCS.UTF.BMP.BOM等名词 这是一篇程序员写给程序员的趣味读物.所谓趣味是指可以比较轻松地了解一些原来不清楚的概念,增进知识,类似于打RPG游戏的升级 ...

随机推荐

  1. CSS3:clip-path

    旧的clip 旧的css也提供了一个clip属性,但这个属性只能用于裁剪一个矩形,其本质是根据overflow:hidden隐藏掉了裁剪外的区域,使用: clip:rect(<top>,& ...

  2. git clone --early EOF

    出现这个问题可能需要重新检查以下方面: 1. Android studio Git 的安装地址:  ..../Git/cmd/git.exe 记得在环境变量 --Path 中进行配置: ,..../G ...

  3. SQL Server Reporting Service(SSRS) 第一篇 我的第一个SSRS例子

    很早就知道SQL SERVER自带的报表工具SSRS,但一直没有用过,最近终于需要在工作中一展身手了,于是我特地按照自己的理解做了以下总结: 1. 安装软件结构 SSRS全称SQL Server Re ...

  4. Centos6.5下的Hadoop安装

    开始进行云计算部分的学习,为了存档,写下现在进行过的步骤 需要用到的主要版本: 虚拟机:Vmware Workstation pro 12.5 Linux系统:CentOS6.4 64bit jdk版 ...

  5. Java Integer(-128~127)值的==和equals比较产生的思考

    最近在项目中遇到一个问题,两个值相同的Integer型值进行==比较时,发现Integer其中的一些奥秘,顺便也复习一下==和equals的区别,先通过Damo代码解释如下: System.out.p ...

  6. android studio gradle结构项目引入本地代码

    1.首先需要用eclipse打开目标项目,file export,选择gradle file. 2.拷贝文件到as项目的根目录[可选] 3.找到as项目的根目录下 .idea目录,下面有个module ...

  7. javascript实例备忘

    一,javascript动态显示: 如显示效果上图所示: 如图显示鼠标放在百度谷歌等字样上市动态显示其内容明细:代码如下:<head><title></title> ...

  8. angularjs 嵌套控制器,子控制器访问父控制器

    <pre> http://www.lovelucy.info/understanding-scopes-in-angularjs.html http://blog.csdn.net/jfk ...

  9. JS学习笔记--轮播图效果

    希望通过自己的学习收获哪怕收获一点点,进步一点点都是值得的,加油吧!!! 本章知识点:index this for if else 下边我分享下通过老师教的方式写的轮播图,基础知识实现: 1.css代 ...

  10. css 分享之background-attachment 属性

    微分享才发现的css背景图达到的效果代码属性: background-attachment -- 定义背景图片随滚动轴的移动方式: 值 描述 scroll 默认值.背景图像会随着页面其余部分的滚动而移 ...