What's the difference between unicode and utf8?

Is it true that unicode=utf16 ?

UPDATE

Many are saying unicode is a standard not an encoding,but most editors support save as Unicode encoding actually.

As Rasmus states in his article "The difference between UTF-8 and Unicode?" (link fixed):

If asked the question, "What is the difference between UTF-8 and Unicode?", would you confidently reply with a short and precise answer? In these days of internationalization all developers should be able to do that. I suspect many of us do not understand these concepts as well as we should. If you feel you belong to this group, you should read this ultra short introduction to character sets and encodings.

Actually, comparing UTF-8 and Unicode is like comparing apples and oranges:

UTF-8 is an encoding - Unicode is a character set

A character set is a list of characters with unique numbers (these numbers are sometimes referred to as "code points"). For example, in the Unicode character set, the number for A is 41.

An encoding on the other hand, is an algorithm that translates a list of numbers to binary so it can be stored on disk. For example UTF-8 would translate the number sequence 1, 2, 3, 4 like this:
00000001 00000010 00000011 00000100 
Our data is now translated into binary and can now be saved to disk.

All together now

Say an application reads the following from the disk:
1101000 1100101 1101100 1101100 1101111 
The app knows this data represent a Unicode string encoded with UTF-8 and must show this as text to the user. First step, is to convert the binary data to numbers. The app uses the UTF-8 algorithm to decode the data. In this case, the decoder returns this:
104 101 108 108 111 
Since the app knows this is a Unicode string, it can assume each number represents a character. We use the Unicode character set to translate each number to a corresponding character. The resulting string is "hello".

Conclusion

So when somebody asks you "What is the difference between UTF-8 and Unicode?", you can now confidently answer short and precise:

UTF-8 and Unicode cannot be compared. UTF-8 is an encoding used to translate numbers into binary data. Unicode is a character set used to translate characters into numbers.

share improve this answer

edited May 2 at 15:42

Rasmus Rønn Nielsen

12010

answered Nov 3 '12 at 19:09

vikas devde

5,36772336

@vikas...I wish I could upvote you 100 times...but thanks for explaining it very very clearly! – user547453 Dec 28 '12 at 19:04

LOVELY! Thankyou... – OceanBlue Mar 31 '13 at 1:36

Smashing indeed! – MalsR May 1 '13 at 22:56

This is totally correct, and answers the question posed in the title. It does not however answer the actual question, which is based on a misrepresentation of Microsoft using Unicode to refer to UTF-16. – Mark Ransom Feb 13 '14 at 14:07

Feel relaxed after finding this. Thanks vikas – Ramyavjr Mar 2 '14 at 14:56

most editors support save as ‘Unicode’ encoding actually.

This is an unfortunate misnaming perpetrated by Windows.

Because Windows uses UTF-16LE encoding internally as the memory storage format for Unicode strings, it considers this to be the natural encoding of Unicode text. In the Windows world, there are ANSI strings (the system codepage on the current machine, subject to total unportability) and there are Unicode strings (stored internally as UTF-16LE).

This was all devised in the early days of Unicode, before we realised that UCS-2 wasn't enough, and before UTF-8 was invented. This is why Windows's support for UTF-8 is all-round poor.

This misguided naming scheme became part of the user interface. A text editor that uses Windows's encoding support to provide a range of encodings will automatically and inappropriately describe UTF-16LE as “Unicode”, and UTF-16BE, if provided, as “Unicode big-endian”.

(Other editors that do encodings themselves, like Notepad++, don't have this problem.)

If it makes you feel any better about it, ‘ANSI’ strings aren't based on any ANSI standard, either.

UTF-8和Unicode的更多相关文章

Unicode、UTF－8 和 ISO8859-1到底有什么区别
说明:本文转载于新浪博客,旨在方便知识总结.原文地址:http://blog.sina.com.cn/s/blog_673c81990100t1lc.html 本文主要包括以下几个方面:编码基本知识, ...
UNICODE UTF编码方式解析
先明确几个概念基础概念部分 1.字符编码方式CEF(Character Encoding Form) 对符号进行编码,便于处理与显示常用的编码方式有 GB2312(汉字国标码 2字节) ASCII ...
RapidJSON 代码剖析（三）：Unicode 的编码与解码
根据 RFC-7159: 8.1 Character Encoding JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The defa ...
Unicode与UTF8相互转化（使用MultiByteToWideChar）
1.简述最近在发送网络请求时遇到了中文字符乱码的问题,在代码中调试字符正常,用抓包工具抓的包中文字符显示正常,就是发送到服务器就显示乱码了,那就要将客户端和服务器设置统一的编码(UTF-8),而我们 ...
Unicode其实是Latin1的扩展。只有一个低字节的Uncode字符其实就是Latin1字符——附各种字符编码表及转换表
一.概念 1,ASCII ASCII(American Standard Code for Information Interchange),中文名称为美国信息交换标准代码.是 ...
关于JAVA字符编码：Unicode,ISO-8859-1,GBK,UTF-8编码及相互转换
我们最初学习计算机的时候,都学过ASCII编码. 但是为了表示各种各样的语言,在计算机技术的发展过程中,逐渐出现了很多不同标准的编码格式, 重要的有Unicode.UTF.ISO-8859-1和中国人 ...
ASCII码、ISO8859-1、Unicode、GBK和UTF-8 的区别
为什么需要编码? 计算机中最小的存储单位是字节(byte),一个字节所能表示的字符数又有限,1byte=8bit,一个字节最多也只能表示255个字符,而世界上的语种又多,都有各种不同的字符,无法用一个 ...
精确解释Unicode
来自:http://blog.csdn.net/gqqnb/article/details/6266542 ---------------------------------------------- ...
Unicode与UTF-8/UTF-16/UTF-32的区别
Unicode的最初目标,是用1个16位的编码来为超过65000字符提供映射.但这还不够,它不能覆盖全部历史上的文字,也不能解决传输的问题 (implantation head-ache's),尤其在 ...
使用 WideCharToMultiByte Unicode 与 UTF-8互转
1.简述最近在发送网络请求时遇到了中文字符乱码的问题,在代码中调试字符正常,用抓包工具抓的包中文字符显示正常,就是发送到服务器就显示乱码了,那就要将客户端和服务器设置统一的编码(UTF-8),而我们 ...

随机推荐

C#显示SQL语句格式
--SQL SERVER生成测试环境: Create database Test; go USE [Test] GO if OBJECT_ID('Tab','U') is not null drop ...
VC++ Debug编译方式
字节填充 VC++在Debug编译方式下,new的内存用0xcd(助记词为Cleared Data)填充,防止未初始化: delete后,内存用0xdd(Dead Data)填充,防止再次被使用. 这 ...
利用网络流传的WebShell默认密码库寻找WebShell
声明:本文提到的技术,仅可用作网络安全加固等合法正当目的.本文作者无法鉴别判断读者阅读本文的真实目的,敬请读者在本国法律所允许范围内阅读本文,读者一旦因非法使用本文提到技术而违反国家相关的法律法规,所 ...
arm汇编进入C函数分析,C函数压栈，出栈，传参，返回值
环境及代码介绍环境和源码由于有时候要透彻的理解C里面的一些细节问题,所有有必要看看汇编,首先这一切的开始就是从汇编代码进入C的main函数过程.这里不使用编译器自动生成的这部分汇编代码,因为编译器 ...
Centos下mysql数据库备份与恢复的方法
一.mysqldump工具备份 mysqldump由于是mysql自带的备份工具,所以也是最常用的mysql数据库的备份工具.支持基于InnoDB的热备份.但由于是逻辑备份,所以速度不是很快,适合备份 ...
ajax使用serialize()序列化提交
form 表单使用.serialize()序列化后会出现中文乱码的问题原因: .serialize()自动调用了encodeURIComponent方法将数据编码了解决方法: 调用decodeUR ...
边工作边刷题：70天一遍leetcode: day 1
(今日完成:Two Sum, Add Two Numbers, Longest Substring Without Repeating Characters, Median of Two Sorted ...
JavaScript作用域闭包简述
JavaScript作用域闭包简述作用域技术一般水平有限,有什么错的地方,望大家指正. 作用域就是变量起作用的范围.作用域包括全局作用域,函数作用域以块级作用域,ES6中的let和const可以形 ...
[No000039]操作系统Operating Systems用户级线程User Threads
多进程是操作系统的基本图像是否可以资源不动而切换指令序列? 进程 = 资源 + 指令执行序列线程: 保留了并发的优点,避免了进程切换代价实质就是映射表不变而PC 指针变多个执行序列+ 一个地址 ...
[No000002]大学本科文凭贬值了多少?
<大学本科文凭贬值了多少?> 朋友开网络公司,招应届毕业生.他们是小本经营,人手本就不多,面试的时候,忙不过来就会拉我过去,假装是公司的面试官.主管什么的,算是滥竽充数.我装模作样面试了几 ...

UTF-8和Unicode

What's the difference between unicode and utf8?

Conclusion

UTF-8和Unicode的更多相关文章

随机推荐

热门专题