Byte History

https://en.wikipedia.org/wiki/Byte

The term byte was coined by Werner Buchholz in July 1956, during the early design phase for the IBM Stretch^[7]^[8] computer, which had addressing to the bit and variable field length (VFL) instructions with a byte size encoded in the instruction. It is a deliberate respelling of bite to avoid accidental mutation to bit.^[1]

Early computers used a variety of four-bit binary coded decimal (BCD) representations and the six-bit codes for printable graphic patterns common in the U.S. Army (Fieldata) and Navy. These representations included alphanumeric characters and special graphical symbols. These sets were expanded in 1963 to seven bits of coding, called the American Standard Code for Information Interchange (ASCII) as the Federal Information Processing Standard, which replaced the incompatible teleprinter codes in use by different branches of the U.S. government and universities during the 1960s. ASCII included the distinction of upper- and lowercase alphabets and a set of control characters to facilitate the transmission of written language as well as printing device functions, such as page advance and line feed, and the physical or logical control of data flow over the transmission media. During the early 1960s, while also active in ASCII standardization, IBM simultaneously introduced in its product line of System/360 the eight-bitExtended Binary Coded Decimal Interchange Code (EBCDIC), an expansion of their six-bit binary-coded decimal (BCDIC) representation used in earlier card punches.^[9] The prominence of the System/360 led to the ubiquitous adoption of the eight-bit storage size, while in detail the EBCDIC and ASCII encoding schemes are different.

In the early 1960s, AT&T introduced digital telephony first on long-distance trunk lines. These used the eight-bit µ-law encoding. This large investment promised to reduce transmission costs for eight-bit data.

The development of eight-bit microprocessors in the 1970s popularized this storage size. Microprocessors such as the Intel 8008, the direct predecessor of the 8080 and the 8086, used in early personal computers, could also perform a small number of operations on the four-bit pairs in a byte, such as the decimal-add-adjust (DAA) instruction. A four-bit quantity is often called a nibble, also nybble, which is conveniently represented by a single hexadecimal digit.

The term octet is used to unambiguously specify a size of eight bits. It is used extensively in protocol definitions.

Historically, the term octad or octade was used to denote eight bits as well at least in Western Europe;^[6]^[5] however, this usage is no longer common today. The exact origin of the term is unclear, but it can be found in British, Dutch and German sources of the 1960s and 1970s, and throughout the documentation of Philips mainframe computers.

//4-6-7-bit
//Syetem/360 8-bit storage size

http://www.regexlab.com/zh/encoding.htm

1.2 字符，字节，字符串

理解编码的关键，是要把字符的概念和字节的概念理解准确。这两个概念容易混淆，我们在此做一下区分：

	概念描述	举例
字符	人们使用的记号，抽象意义上的一个符号。	'1', '中', 'a', '$', '￥', ……
字节	计算机中存储数据的单元，一个8位的二进制数，是一个很具体的存储空间。	0x01, 0x45, 0xFA, ……
ANSI 字符串	在内存中，如果“字符”是以 ANSI 编码形式存在的，一个字符可能使用一个字节或多个字节来表示，那么我们称这种字符串为 ANSI 字符串或者多字节字符串。	"中文123" （占7字节）
UNICODE 字符串	在内存中，如果“字符”是以在 UNICODE 中的序号存在的，那么我们称这种字符串为UNICODE 字符串或者宽字节字符串。	L"中文123" （占10字节）

由于不同 ANSI 编码所规定的标准是不相同的，因此，对于一个给定的多字节字符串，我们必须知道它采用的是哪一种编码规则，才能够知道它包含了哪些“字符”。而对于 UNICODE 字符串来说，不管在什么环境下，它所代表的“字符”内容总是不变的。

1.3 字符集与编码

各个国家和地区所制定的不同 ANSI 编码标准中，都只规定了各自语言所需的“字符”。比如：汉字标准（GB2312）中没有规定韩国语字符怎样存储。这些 ANSI 编码标准所规定的内容包含两层含义：

使用哪些字符。也就是说哪些汉字，字母和符号会被收入标准中。所包含“字符”的集合就叫做“字符集”。
规定每个“字符”分别用一个字节还是多个字节存储，用哪些字节来存储，这个规定就叫做“编码”。

各个国家和地区在制定编码标准的时候，“字符的集合”和“编码”一般都是同时制定的。因此，平常我们所说的“字符集”，比如：GB2312, GBK, JIS 等，除了有“字符的集合”这层含义外，同时也包含了“编码”的含义。

“UNICODE 字符集”包含了各种语言中使用到的所有“字符”。用来给 UNICODE 字符集编码的标准有很多种，比如：UTF-8, UTF-7, UTF-16, UnicodeLittle, UnicodeBig 等。

1.1 字符与编码的发展

从计算机对多国语言的支持角度看，大致可以分为三个阶段：

	系统内码	说明	系统
阶段一	ASCII	计算机刚开始只支持英语，其它语言不能够在计算机上存储和显示。	英文 DOS
阶段二	ANSI编码（本地化）	为使计算机支持更多语言，通常使用 0x80~0xFF 范围的 2 个字节来表示 1 个字符。比如：汉字 '中' 在中文操作系统中，使用 [0xD6,0xD0] 这两个字节存储。不同的国家和地区制定了不同的标准，由此产生了 GB2312, BIG5, JIS 等各自的编码标准。这些使用 2 个字节来代表一个字符的各种汉字延伸编码方式，称为 ANSI 编码。在简体中文系统下，ANSI 编码代表 GB2312 编码，在日文操作系统下，ANSI 编码代表 JIS 编码。不同 ANSI 编码之间互不兼容，当信息在国际间交流时，无法将属于两种语言的文字，存储在同一段 ANSI 编码的文本中。	中文 DOS，中文 Windows 95/98，日文 Windows 95/98
阶段三	UNICODE （国际化）	为了使国际间信息交流更加方便，国际组织制定了 UNICODE 字符集，为各种语言中的每一个字符设定了统一并且唯一的数字编号，以满足跨语言、跨平台进行文本转换、处理的要求。	Windows NT/2000/XP，Linux，Java

字符串在内存中的存放方法：

在 ASCII 阶段，单字节字符串使用一个字节存放一个字符（SBCS）。比如，"Bob123" 在内存中为：

42	6F	62	31	32	33	00

B	o	b	1	2	3	\0

在使用 ANSI 编码支持多种语言阶段，每个字符使用一个字节或多个字节来表示（MBCS），因此，这种方式存放的字符也被称作多字节字符。比如，"中文123" 在中文 Windows 95 内存中为7个字节，每个汉字占2个字节，每个英文和数字字符占1个字节：

D6	D0	CE	C4	31	32	33	00

中		文		1	2	3	\0

在 UNICODE 被采用之后，计算机存放字符串时，改为存放每个字符在 UNICODE 字符集中的序号。目前计算机一般使用 2 个字节（16 位）来存放一个序号（DBCS），因此，这种方式存放的字符也被称作宽字节字符。比如，字符串 "中文123" 在 Windows 2000 下，内存中实际存放的是 5 个序号：

2D	4E	87	65	31	00	32	00	33	00	00	00	← 在 x86 CPU 中，低字节在前

中		文		1		2		3		\0

一共占 10 个字节。

发问：

0-语别前缀，这样‘中’就不用2byte了啊？

Byte History的更多相关文章

1 byte 8 bit 1 sh 1 bit 2. 字符与编码在程序中的实现
https://en.wikipedia.org/wiki/Shannon_(unit) 1字节(英语:Byte)=8比特(英语:bit) The shannon (symbol Sh), also ...
编码占用的字节数 1 byte 8 bit 1 sh 1 bit 中文字符编码 2. 字符与编码在程序中的实现变长编码 Unicode UTF-8 转换在网络上传输保存到磁盘上 bytes
小结: 1.UNICODE 字符集编码的标准有很多种,比如:UTF-8, UTF-7, UTF-16, UnicodeLittle, UnicodeBig 等: 2 服务器->网页 utf-8 ...
weblogic漏洞总结复现（未完）
复现方式 Docker复现 WEBlogic爆出了很多漏洞先了解一下现在主流的版本 Weblogic 10.3.6.0 Weblogic 12.1.3.0 Weblogic 12.2.1.1 Web ...
多种下载文件方式 Response.BinaryWrite(byte[] DocContent)；Response.WriteFile(System.IO.FileInfo DownloadFile .FullName);Response.Write(string html2Excel);
通过html给xls赋值,并下载xls文件一.this.Response.Write(sw.ToString());System.IO.StringWriter sw = new System.IO ...
python3.4 UnicodeDecodeError: 'gbk' codec can't decode byte 0xff in position
python3.4 UnicodeDecodeError: 'gbk' codec can't decode byte 0xff in position 实用python的时候打开一个csv的文件出 ...
代码漏洞扫描描述Cross Site History Manipulation解决办法[dongcoder.com]
代码漏洞扫描漏洞描述:Cross Site History Manipulation 简要描述:产品的行为差异或发送不同的反应,在某种程度上暴露了与安全性相关的产品状态,例如特定的操作是否成功.可能 ...
centos shell基础 alias 变量单引号双引号 history 错误重定向 2>&1 jobs 环境变量 .bash_history source配置文件 nohup & 后台运行 cut,sort,wc ,uniq ,tee ,tr ,split, paste cat> 2.txt <<EOF 通配符 glob模式发邮件命令mail 2015-4-8 第十二节课
centos shell基础知识 alias 变量单引号双引号 history 错误重定向 2>&1 jobs 环境变量 .bash_history source配置文件 ...
linux cut: invalid byte, character or field list Try 'cut --help' for more information.
1. 概述 centos执行简单shell 脚本报错 cut: invalid byte, character or field listTry 'cut --help' for more info ...
《Inetnet History，Technology and Security》学习笔记
前言本文为观看Cousera的Michigan<Internet History, Technology and Security>教程的个人学习笔记,包括了每个week的概要和个人感想 ...

随机推荐

继承ActionSupport，返回INPUT的原因
http://developer.51cto.com/art/200907/134757.htm 表面现象: 在WebWork中,当一个Action中既没有重写ActionSupport中的valid ...
Java Applet与Java Application的区别
转自:http://www.educity.cn/java/500609.html 在Java语言中,能够独立运行的程序称为Java应用程序(Application).Java语言还有另外一种程序-- ...
SQLite使用方法 SQLiteOpenHelper操作（转）
SQLiteOpenHelper主要用于创建数据库 SQLiteDatabase 主要用于执行sql语句程序内使用SQLite数据库是通过SQLiteOpenHelper进行操作 1. ...
Codeforces Gym 100513G G. FacePalm Accounting
G. FacePalm Accounting Time Limit: 20 Sec Memory Limit: 256 MB 题目连接 http://codeforces.com/gym/100513 ...
【HTML5】Canvas和SVG的区别
* SVG SVG 是一种使用 XML 描述 2D 图形的语言. SVG 基于 XML,这意味着 SVG DOM 中的每个元素都是可用的.您可以为某个元素附加 JavaScript 事件处理器. 在 ...
WPF的Presenter(ContentPresenter)（转）
这是2年前写了一篇文章 http://www.cnblogs.com/Clingingboy/archive/2008/07/03/wpfcustomcontrolpart-1.html 我们先来看M ...
HDU2825 Wireless Password（AC自动机+状压DP）
题目问长度n至少包含k个咒语的字符串有多少个.也是比较入门的题.. dp[i][j][S]表示长度i(在自动机上转移k步)且后缀状态为自动机上第j个结点且当前包含咒语集合为S的方案数 dp[0][0] ...
Converting Stream to String and back…what are we missing?
string test = "Testing 1-2-3"; // convert string to stream byte[] byteArray = Encoding.ASC ...
编写爬虫程序的神器 - Groovy + Jsoup + Sublime
写过很多个爬虫小程序了,之前几次主要用C# + Html Agility Pack来完成工作.由于.NET BCL只提供了"底层"的HttpWebRequest和"中层& ...
ios clang: error: linker command failed with exit code 1 (use -v to see invocation)解决方法
当xcode编译时出现这个错误,一般是你的编译源码中存在重复的源码解决方法:"Build Phases" -> "Compile Sources" 去删 ...