http://csharpindepth.com/Articles/General/Unicode.aspx

Scope of this page

This is a big topic. Don't expect this page to do more than scratch the surface - indeed, if you believe you're already fairly experienced and knowledgeable about character encodings and the like, this page may well not have anything new or useful for you. However, there are still many people who don't understand the difference between binary and text, or know what a character encoding is, etc. It is for these people that this page has been written. It mentions a few advanced topics, but only to make the reader aware of their existence, rather than to give much guidance on them.

Resources

可以去原文看链接

Binary and text - a big distinction

Most modern computer languages (and some older ones) make a big distinction between "binary" content and "character" (or "text") content .

The difference is largely the same as the instinctive本能的;直觉的;天生的 one, but for the purposes of clarity清楚,明晰;透明, I'll define it here as:

  • Binary content is a sequence of octets八位字节 (bytes in common parlance【in common parlance俗话说】) with no intrinsic本质的,固有的 meaning attached. Even though there may be external means of understanding a piece of binary content to be, say, a picture, or an executable file, the content itself is just a sequence of bytes. (Note for pedantic 迂腐的;学究式的 readers: from now on, I won't use the word "octet". I'll use "byte" instead, even though strictly speaking a byte needn't be an octet. There have been architectures with 9-bit bytes, for instance. I don't believe that's a particularly relevant or useful distinction to make in this day and age, and readers are likely to be more comfortable with the word "byte".)
  • Character content is a sequence of characters.

The Unicode Glossary defines a character as:

  1. The smallest component of written language that has semantic语义的 value; refers to指的是 the abstract meaning and/or shape, rather than a specific shape (see also glyph图形字符), though in code tables some form of visual representation is essential基本的,必要的 for the reader's understanding.
  2. Synonym同义词 for abstract character. (See Definition D3 in Section 3.3, Characters and Coded Representations .http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf#G2212)
  3. The basic unit of encoding for the Unicode character encoding.
  4. The English name for the ideographic表意的 written elements of Chinese origin. (See ideograph (2).)

That may or may not be a terribly useful definition to you, but for the most part you can again use your instinctive 本能的;直觉的;

understanding - a character is something like "the capital letter A", "the digit 1" etc.

There are other characters which are less obvious明显的;显著的;, such as: combining characters such as "an acute accent重音符;尖音符", control characters such as "newline", and formatting characters (invisible, but affect surrounding characters).

The important thing is that these are fundamentally "text" in some form or other. They have some meaning attached to them.

Now, unfortunately in the past, this distinction has been very blurred模糊不清的 - C programmers are often used to thinking of "byte" and "char" as being interchangeable 可互换的,

to the extent在某种程度上 that they will talk about reading a certain number of characters, even when the content is entirely binary.

In modern environments such as .NET and Java, where the distinction is clear and present in the IO libraries,

this can lead to people attempting to copy binary files by reading and writing characters, resulting in corrupt output.

Where does Unicode come in?

The Unicode Consortium财团;联合;合伙 is a body trying to standardise使…标准化 the handling of character data,

including its transformation转换 to and from binary form (otherwise known as encoding编码 and decoding解码).

There is also a set of ISO standards (10646 in various versions) which do similar things;

Unicode and ISO 10646 can largely be regarded as "the same thing" in that they are compatible in almost all respects.

(In theory ISO 10646 defines a larger potential潜在的;可能的 set of characters, but this is never likely to become an issue.)

Most modern computer languages and environments, such as .NET and Java, use Unicode for character representations表示.

Unicode defines, amongst在…之中 other things,

an abstract character repertoire计算机指令系统 (the set of characters it covers),

coded character set (a mapping from each character in the repertoire to a non-negative integer),

some character encoding forms (mappings from the non-negative integers in the coded character set to sequences of "code units" (eg bytes)),

and some character encoding schemes (mappings from sequences of code units into a serialized byte sequences).

The difference between a character encoding form and a character encoding scheme is slightly subtle微妙的;精细的, but takes account of things like endianness字节顺序.

(For instance, the UCS-2 code unit sequence 0xc2 0xa9 may be serialized as 0xc2 0xa9 or 0xa9 0xc2, and it's the character encoding scheme that decides that.)

The Unicode abstract character repertoire can, in theory, hold up to 1114112 characters, although many are reserved to be invalid and the rest aren't all likely to ever be assigned.

Each character is coded as an integer between 0 and 1114111 (0x10ffff).For instance, capital A is coded as 65.

Until a few years ago, it was hoped that only characters in the range 0 to 2^16-1 would be required,

which would have meant that each character would only have required 2 bytes to be represented.

Unfortunately, more characters were needed, surrogate代理的 pairs were introduced.

They confuse things significantly意味深长地 (at least, they confuse me significantly) and most of the rest of this page will ignore their existence - I'll cover them briefly in the "nasty肮脏的 bits" section.

What does .NET provide?

If all of this sounds rather confusing, don't worry.

It's worth being aware of the distinctions above, but they don't often actually come to the fore在前.

Most of the time you just want to convert some bytes into some characters, and vice versa反之亦然.

This is where the System.Text.Encoding class comes in, along with the System.Char structure (aka又叫做 char in C#) and the System.String class (aka string in C#).

The char is the most basic character type.

Each char is a single Unicode character.

It takes 2 bytes in memory, and can take a value of 0-65535.

Note that not all values are thus actually valid Unicode characters.

string is just a sequence of chars, fundamentally.

It's immutable不变的, which means that once you've created a string instance (however you've done it) you can't change it -

the various methods in the string class which suggest暗示 that they're changing the string in fact just return a new string which is the original character sequence with the changes applied.

The System.Text.Encoding class provides facilities工具 for converting arrays of bytes to arrays of characters, or strings, and vice versa.

The class itself is abstract; various implementations are provided by .NET and can easily be instantiated, and users can write their own derived classes if they wish.

(This is quite a rare稀有的 requirement, however - most of the time you'll be fine with the built-in内置 implementations.)

An encoding编码 can also provide separate encoders编码器 and decoders解码器, which maintain state between calls.

This is necessary for multi-byte character encoding schemes, where you may not be able to decode all the bytes you have so far received from a stream流.

For instance, if a UTF-8 decoder receives 0x41 0xc2, it can return the first character (a capital大写字母 A) but must wait for the third byte to determine what the second character is.

Built-in encoding schemes

.NET provides various encoding schemes "out of the box". What follows below is a description (as far as I can find) of the various different encoding schemes, and how they can be retrieved检索.

ASCII

ASCII is one of the most commonly known and frequently misunderstood误解 character encodings. Contrary与…相反 to popular belief信仰,教义, it is only 7 bit - there are no ASCII characters above 127.

If anyone says that they wish to encode (for example) "ASCII 154" they may well not know exactly which encoding they actually mean.

If pressed, they're likely to say it's "extended ASCII". There is no encoding scheme called "extended ASCII".

There are many 8-bit encodings which are supersets超集 of ASCII, and usually it is one of these which is meant - commonly whatever Windows Code Page is the default for their computer.

Every ASCII character has the same value in the ASCII encoded as in the Unicode coded character set - in other words, ASCII x is the same character as Unicode x for all characters within ASCII.

The .NET ASCIIEncoding class (an instance of which can be easily retrieved using the Encoding.ASCII property) is slightly odd古怪的,

in my view, as it appears to encode by merely仅仅 stripping away all bits above the bottom 7.  去除了最高位

This means that, for instance, Unicode character 0xb5 ("micro sign") after encoding and decoding would become Unicode 0x35 ("digit five"),

rather than some character showing that it was the result of encoding a character not contained within ASCII.

0xb5对应的二进制1011 0101 ; 0x35对应的二进制0011 0101

UTF-8

UTF-8 is a good general-purpose多用途的 way of representing Unicode characters.

Each character is encoded as a sequence of 1-4 bytes.

(All the characters < 65536 are encoded in 1-3 bytes; I haven't checked whether .NET encodes surrogates代理 as two sequences of 1-3 bytes, or as one sequence of 4 bytes).

It can represent all characters, it is "ASCII-compatible" in that any sequence of characters in the ASCII set is encoded in UTF-8 to exactly the same sequence of bytes as it would be in ASCII.

In addition, the first byte is sufficient足够的 to say how many additional bytes (if any) are required for the whole character to be decoded.

UTF-8 itself needs no byte-ordering mark (BOM) although it could be used as a way of giving evidence证据 that the file is indeed in UTF-8 format.

The UTF-8 encoded BOM is always 0xef 0xbb 0xbf. Obtaining a UTF-8 encoding in .NET is simple - use the Encoding.UTF8 property.

In fact, a lot of the time you don't even need to do that - many classes (such asStreamWriter) used UTF-8 by default when no encoding is specified.

(Don't be misled误导 by Encoding.Default - that's something else entirely!) I suggest always specifying the encoding however, just for the sake of readability.

UTF-16 and UCS-2

UTF-16 is effectively how characters are maintained internally in .NET.

Each character is encoded as a sequence of 2 bytes, other than surrogates which take 4 bytes.

The opportunity of using surrogates is the only difference between UTF-16 and UCS-2 (also known as just "Unicode"), the latter of which can only represent characters 0-0xffff.

UTF-16 can be big-endian, little-endian, or machine-dependent with optional BOM (0xff 0xfe for little-endianness, and 0xfe 0xff for big-endianness).

In .NET itself, I believe the surrogate issues are effectively实际上 forgotten, and each value in the surrogate pair替换对 is treated as an individual character, making UCS-2 and UTF-16 "the same" in a fuzzy sort of way.

(The exact differences between UCS-2 and UTF-16 rely on deeper understanding of surrogates than I have, I'm afraid - if you need to know details of the differences, chances are you'll know more than I do anyway.)

A big-endian encoding may be retrieved using Encoding.BigEndianUnicode, and a little-endian encoding may be retrieved using Encoding.Unicode.

Both are instances of System.Text.UnicodeEncoding, which can also be constructed directly with appropriate parameters for whether or not to emit the BOM and which endianness to use when encoding.

I believe (although I haven't tested) that when decoding binary content, a BOM in the content overrides the endianness of the encoder, so the programmer doesn't need to do any extra work to decode appropriately if they either know the endianness or the content contains a BOM.

UTF-7

UTF-7 is rarely used, in my experience, but encodes Unicode (possibly only the first 65535 characters) entirely into ASCII characters (not bytes!).

This can be useful for mail where the mail gateway may only support ASCII characters, or some subset of ASCII (in, for example, the EBCDIC encoding).

This description sounds fairly woolly for a reason: I haven't looked into it in any detail, and don't intend to.

If you need to use it, you'll probably understand it reasonably well anyway, and if you don't absolutely have to use it, I'd suggest steering clear.

An encoding instance in .NET can be retrieved using Encoding.UTF7

Windows/ANSI Code Pages

Windows Code Pages are usually either single or double byte character sets, encoding up to 256 or 65536 characters respectively.

Each is numbered, an encoding for a known code page number can be retrieved using Encoding.GetEncoding(int).

Code pages are mostly useful for legacy data which is often stored in the "default code page".

An encoding for the default code page can be retrieved using Encoding.Default.

Again, I try to avoid using code pages where possible. More information is available in the MSDN.

ISO-8859-1 (Latin-1)

Like ASCII, every character in Latin-1 has the same code there as in Unicode.

I haven't been able to ascertain查明 for certain whether or not Latin-1 has a "hole" of undefined characters from 128 to 159, or whether it contains the same control characters there that Unicode does.

(I had begun to lean towards the "hole" idea, but Wikipedia disagrees, so I'm still sitting on the fence).

Latin-1 is also code page 28591, so obtaining an encoding for it is simple: Encoding.GetEncoding (28591).

Streams, readers and writers

Streams are by their nature binary - they read and write bytes, fundamentally.

Anything which takes a string is going to do some kind of conversion to bytes, which may or may not be what you want.

The equivalents of streams for reading and writing text are System.IO.TextReader and System.IO.TextWriter respectively.

If you have a stream already, you can use System.IO.StreamReader (which derives from TextReader) and System.IO.StreamWriter (which derives from TextWriter) respectively,

constructing them with the stream and the encoding you wish to use.

If you don't specify the encoding, UTF-8 is assumed.

Here is some example code to convert a file from UTF-8 to UCS-2:

using System;
using System.IO;
using System.Text; public class FileConverter
{
const int BufferSize = ; public static void Main(string[] args)
{
if (args.Length != )
{
Console.WriteLine
("Usage: FileConverter <input file> <output file>");
return;
} // Open a TextReader for the appropriate file
using (TextReader input = new StreamReader
(new FileStream (args[], FileMode.Open),
Encoding.UTF8))
{
// Open a TextWriter for the appropriate file
using (TextWriter output = new StreamWriter
(new FileStream (args[], FileMode.Create),
Encoding.Unicode))
{
// Create the buffer
char[] buffer = new char[BufferSize];
int len; // Repeatedly copy data until we've finished
while ( (len = input.Read (buffer, , BufferSize)) > )
{
output.Write (buffer, , len);
}
}
}
}
}

Note that this demonstrates using the constructors for TextReader and TextWriter which take streams.

There are also constructors which take filenames as parameters, so that you don't have to manually open a FileStream in your code.

Other parameters, such as the buffer size and whether or not to detect a BOM if present, are available - see the documentation for more details.

Finally, as of .NET 2.0 you should also look at the File class for all kinds of convenience methods

Difficult bits

Okay, so those are the basics of Unicode.

There are then lots of extra bits, some of which have already been hinted at, and which people ought to be aware of,

even if they deem them too unlikely to be relevant for their application to be worth sorting out.

I don't offer any general techniques or guiding principles here - I'm just trying to raise some awareness.

This is by no means an exhaustive详尽的 list, either - these are just some of the nasty bits.

It's important to recognise that a lot of the difficulty here is in no way the fault of the Unicode Consortium - just as with dates and times and any number of other internationalisation problems,

humanity has got itself into a fundamentally tricky situation over the course of its history.

Culture-sensitive searching and casing

These are covered in my article on .NET string handling.

Surrogate pairs

Now that Unicode has more than 65536 characters, it can't be represented in two bytes.

This means that a .NET char value can't store all possible values.

The solution UTF-16 uses is that of surrogate pairs: pairs of 16-bit values where each value is between 0xd800 and 0xdfff.

In other words, two "sort of" characters make one "real" character.

(UCS-4 and UTF-32 get round this problem entirely by having wider values to start with - when everything's four bytes, you can get all possible characters in.)

This is basically a headache - it means that a string of 10 chars can actually represent anywhere between 5 and 10 "real" Unicode characters.

Fortunately, most applications which don't involve scientific/mathematical notation符号 and Han characters are unlikely to need to worry too much about them.

Whether or not that applies to you is a different matter - and exactly which bits of your code are sensitive to surrogates will also vary between applications.

Combining characters

Not all characters should result in a single character being drawn on the screen.

An accented character can be represented as the unaccented character followed by the accented combining character.

Some GUI systems will support combining characters, some won't - and the impact on your application will depend on what assumptions假定,假想 you're making.

Normalization

Partly due to things like combining characters, there can be several ways of representing what is in some senses a single character.

Character sequences can be normalised to use combining characters wherever possible, or to avoid using combining characters wherever possible.

Should your application treat two different sequences representing the same actual character as different or the same?

Do any components you need rely on sequences being normalized in one particular way?

Unicode explorer

It can be cumbersome笨重的 to work out some of the details of this by hand,

so you can use the little Javascript-based tool below to display useful information about any string you can enter into the text field.

Currently I don't have any support for going the other way (e.g. from UTF-16 code units to text) but hopefully this is still useful.

Character Unicode UTF-16 UTF-8
5 U+0035 0035 35
6 U+0036 0036 36
7 U+0037 0037 37
8 U+0038 0038 38

This table breaks down the text in the text-box into Unicode characters.

It does not perform any kind of normalization, so an accented character may appear as one character or more, depending on whether it is entered as a single character including the accent (e.g. é),

or a non-accented character followed by combining characters (e.g. é - yes, that really is different to the previous example; copy and paste them both to see!).

However, it does break the input into Unicode characters instead of just UTF-16 code units; a surrogate pair is treated as a single character.

For example,

Unicode and .NET的更多相关文章

  1. Python标准模块--Unicode

    1 模块简介 Python 3中最大的变化之一就是删除了Unicode类型.在Python 2中,有str类型和unicode类型,例如, Python 2.7.6 (default, Oct 26 ...

  2. Unicode 和 UTF-8 有何区别?

    Unicode符号范围 (一个字符两个字节)     | UTF-8编码方式 (十六进制)     | (二进制) —————————————————————– 这儿有四个字节从-----00 00 ...

  3. [转]Python中的str与unicode处理方法

    早上被python的编码搞得抓耳挠腮,在搜资料的时候感觉这篇博文很不错,所以收藏在此. python2.x中处理中文,是一件头疼的事情.网上写这方面的文章,测次不齐,而且都会有点错误,所以在这里打算自 ...

  4. Unicode和UTF-8的关系

    Unicode和UTF-8都是表示编码,这个我一直都知道,但是这两个实际上是干什么用的,到底是怎么编码的,为什么有了Unicode还要UTF-8,它们之间有什么联系又有什么区别呢?这个问题一直困扰着我 ...

  5. python2.7 内置ConfigParser支持Unicode读写

    1 python编码基础 对应 C/C++ 的 char 和 wchar_t, Python 也有两种字符串类型,str 与 unicode: str与unicode # -*- coding: ut ...

  6. python中的str,unicode和gb2312

    实例1: v1=u '好神奇的问题!?' type(v1)->unicode v1.decode("utf-8")# not work,because v1 is unico ...

  7. Unicode转义(\uXXXX)的编码和解码

    在涉及Web前端开发时, 有时会遇到\uXXXX格式表示的字符, 其中XXXX是16进制数字的字符串表示形式, 在js中这个叫Unicode转义字符, 和\n \r同属于转义字符. 在其他语言中也有类 ...

  8. SQL Server 中怎么查看一个字母的ascii编码或者Unicode编码

    参考文章:微信公众号文章 在sql中怎么查看一个字符的ascii编码,so easy !! select ASCII('a') SELECT CHAR(97) charNum SELECT UNICO ...

  9. 从Java String实例来理解ANSI、Unicode、BMP、UTF等编码概念

    转(http://www.codeceo.com/article/java-string-ansi-unicode-bmp-utf.html#0-tsina-1-10971-397232819ff9a ...

  10. Unicode简介

    计算机只能处理二进制,因此需要把文字表示为二进制才能被计算机理解和识别. 一般的做法是为每一个字母或汉字分配一个id,然后用二进制表示这个id,存在内存或磁盘中.计算机可以根据二进制数据知道这个id是 ...

随机推荐

  1. BZOJ 1711 吃饭dining/Luogu P1402 酒店之王 拆点+最大流流匹配

    题意: (吃饭dining)有F种食物和D种饮料,每种食物或饮料只能供一头牛享用,且每头牛只享用一种食物和一种饮料.现在有n头牛,每头牛都有自己喜欢的食物种类列表和饮料种类列表,问最多能使几头牛同时享 ...

  2. 模板BSGS(SDOI2011计算器) 模板EXBSGS

    BSGS和EXBSGS是OI中用于解决A^xΞB(mod C)的常用算法. 1.BSGS BSGS用于A,C互质的情况. 令m=sqrt(C),此时x可表示为i*m+j. 式中i和j都<=sqr ...

  3. 页面jsp向后端发送:HTTP 400错误 - 请求无效(Bad request)

    HTTP 400错误 - 请求无效(Bad request) jsp页面有误 在ajax请求后台数据时有时会报 HTTP 400 错误 - 请求无效 (Bad request);出现这个请求无效报错说 ...

  4. KBE_那些事

    批处理文件不要放在工具栏执行,这里有坑:工具栏运行批处理文件,当前路径(%cd%)不是批处理文件所在路径 日志的输出(DEBUG_MSG 和 INFO_MSG)都被输出在({资产库}/logs/*.l ...

  5. [Python3网络爬虫开发实战] 2.4-会话和Cookies

    在浏览网站的过程中,我们经常会遇到需要登录的情况,有些页面只有登录之后才可以访问,而且登录之后可以连续访问很多次网站,但是有时候过一段时间就需要重新登录.还有一些网站,在打开浏览器时就自动登录了,而且 ...

  6. Project Euler

    Euler 34 答案:40730 我用程序算了无数次都是145,蛋疼,最后拿别人的程序仔细对比…… 原来 !=…… 真蛋疼,我竟然连基础数学都忘了 Euler-44 根据公式容易得出:Pmin + ...

  7. Xcode报referenced from错误的总结

    一.库文件丢失 如果提示的文件是库文件,比如说是sdk的文件,有可能是就是丢失,或者没有引用到该工程. 1.点击这个.a库,或者framework,看右边的target里面是否引用到了当前的targe ...

  8. Codeforces Round #352 (Div. 2),A题与B题题解代码,水过~~

    ->点击<- A. Summer Camp time limit per test 1 second memory limit per test 256 megabytes input s ...

  9. 状态压缩DP总结

    POJ1185 炮兵部队问题: 在平原上才能放置炮兵,每个炮兵的上下左右2格之内都不能出现别的炮兵 可以考虑在当前行放置炮兵它的右侧和下侧绝对不会出现炮兵即可,左侧和上侧就能省去考虑 明显的状态压缩d ...

  10. [SCOI2008]奖励关 - 状压动规 - 概率与期望

    Description 你正在玩你最喜欢的电子游戏,并且刚刚进入一个奖励关.在这个奖励关里,系统将依次随机抛出k次宝物,每次你都可以选择吃或者不吃(必须在抛出下一个宝物之前做出选择,且现在决定不吃的宝 ...