Ascii vs. Binary Files
Ascii vs. Binary Files
Introduction
Most people classify files in two categories: binary files and ASCII (text) files. You've actually worked with both. Any program you write (C/C++/Perl/HTML) is almost surely an ASCII file.
An ASCII file is defined as a file that consists of ASCII characters. It's usually created by using a text editor like emacs, pico, vi, Notepad, etc. There are fancier editors out there for writing code, but they may not always save it as ASCII.
As an aside, ASCII text files seem very "American-centric". After all, the 'A' in ASCII stands for American. However, the US does seem to dominate the software market, and so effectively, it's an international standard.
Computer science is all about creating good abstractions. Sometimes it succeeds and sometimes it doesn't. Good abstractions are all about presenting a view of the world that the user can use. One of the most successful abstractions is the text editor.
When you're writing a program, and typing in comments, it's hard to imagine that this information is not being stored as characters. Of course, if someone really said "Come on, you don't really think those characters are saved as characters, do you? Don't you know about the ASCII code?", then you'd grudgingly agree that ASCII/text files are really stored as 0's and 1's.
But it's tough to think that way. ASCII files are really stored as 1's and 0's. But what does it mean to say that it's stored as 1's and 0's? Files are stored on disks, and disks have some way to represent 1's and 0's. We merely call them 1's and 0's because that's also an abstraction. Whatever way is used to store the 0's and 1's on a disk, we don't care, provided we can think of them that way.
In effect, ASCII files are basically binary files, because they store binary numbers. That is, ASCII files store 0's and 1's.
The Difference between ASCII and Binary Files?
An ASCII file is a binary file that stores ASCII codes. Recall that an ASCII code is a 7-bit code stored in a byte. To be more specific, there are 128 different ASCII codes, which means that only 7 bits are needed to represent an ASCII character.
However, since the minimum workable size is 1 byte, those 7 bits are the low 7 bits of any byte. The most significant bit is 0. That means, in any ASCII file, you're wasting 1/8 of the bits. In particular, the most significant bit of each byte is not being used.
Although ASCII files are binary files, some people treat them as different kinds of files. I like to think of ASCII files as special kinds of binary files. They're binary files where each byte is written in ASCII code.
A full, general binary file has no such restrictions. Any of the 256 bit patterns can be used in any byte of a binary file.
We work with binary files all the time. Executables, object files, image files, sound files, and many file formats are binary files. What makes them binary is merely the fact that each byte of a binary file can be one of 256 bit patterns. They're not restricted to the ASCII codes.
Example of ASCII files
Suppose you're editing a text file with a text editor. Because you're using a text editor, you're pretty much editing an ASCII file. In this brand new file, you type in "cat". That is, the letters 'c', then 'a', then 't'. Then, you save the file and quit.
What happens? For the time being, we won't worry about the mechanism of what it means to open a file, modify it, and close it. Instead, we're concerned with the ASCII encoding.
If you look up an ASCII table, you will discover the ASCII code for 0x63, 0x61, 0x74 (the 0x merely indicates the values are in hexadecimal, instead of decimal/base 10).
Here's how it looks:
| ASCII | 'c' | 'a' | 't' |
| Hex | 63 | 61 | 74 |
| Binary | 0110 0011 | 0110 0001 | 0111 1000 |
Each time you type in an ASCII character and save it, an entire byte is written which corresponds to that character. This includes punctuations, spaces, and so forth. I recall one time a student has used 100 asterisks in his comments, and these asterisks appeared everywhere. Each asterisk used up one byte on the file. We saved thousands of bytes from his files by removing comments, mostly the asterisks, which made the file look nice, but didn't add to the clarity.
Thus, when you type a 'c', it's being saved as 0110 0011 to a file.
Now sometimes a text editor throws in characters you may not expect. For example, some editors "insist" that each line end with a newline character.
What does that mean? I was once asked by a student, what happens if the end of line does not have a newline character. This student thought that files were saved as two-dimensions (whether the student realized ir or not). He didn't know that it was saved as a one dimensional array. He didn't realize that the newline character defines the end of line. Without that newline character, you haven't reached the end of line.
The only place a file can be missing a newline at the end of the line is the very last line. Some editors allow the very last line to end in something besides a newline character. Some editors add a newline at the end of every file.
Unfortunately, even the newline character is not that universally standard. It's common to use newline characters on UNIX files, but in Windows, it's common to use two characters to end each line (carriage return, newline, which is \r and \n, I believe). Why two characters when only one is necessary?
This dates back to printers. In the old days, the time it took for a printer to return back to the beginning of a line was equal to the time it took to type two characters. So, two characters were placed in the file to give the printer time to move the printer ball back to the beginning of the line.
This fact isn't all that important. It's mostly trivia. The reason I bring it up is just in case you've wondered why transferring files to UNIX from Windows sometimes generates funny characters.
Editing Binary Files
Now that you know that each character typed in an ASCII file corresponds to one byte in a file, you might understand why it's difficult to edit a binary file.
If you want to edit a binary file, you really would like to edit individual bits. For example, suppose you want to write the binary pattern 1100 0011. How would you do this?
You might be naive, and type in the following in a file:
11000011 |
But you should know, by now, that this is not editing individual bits of a file. If you type in '1' and '0', you are really entering in 0x49 and 0x48. That is, you're entering in 0100 1001and 0100 1000 into the files. You're actually (indirectly) typing 8 bits at a time.
"But, how am I suppose to edit binary files?", you exclaim! Sometimes I see this dilemma. Students are told to perform a task. They try to do the task, and even though their solution makes no sense at all, they still do it. If asked to think about whether this solution really works, they might eventually reason that it's wrong, but then they'd ask "But how do I edit a binary file? How do I edit the individual bits?"
The answer is not simple. There are some programs that allow you type in 49, and it translates this to a single byte, 0100 1001, instead of the ASCII code for '4' and '9'. You can call these programs hex editors. Unfortunately, these may not be so readily available. It's not too hard to write a program that reads in an ASCII file that looks like hex pairs, but then converts it to a true binary file with the corresponding bit patterns.
That is, it takes a file that looks like:
63 a0 de |
and converts this ASCII file to a binary file that begins 0110 0011 (which is 63 in binary). Notice that this file is ASCII, which means what's really stored is the ASCII code for '6', '3', ' ' (space), 'a', '0', and so forth. A program can read this ASCII file then generate the appropriate binary code and write that to a file.
Thus, the ASCII file might contain 8 bytes (6 for the characters, 2 for the spaces), and the output binary file would contain 3 bytes, one byte per hex pair.
Viewing Binary Files
Most operating systems come with some program that allows you to view a file in "binary" format. However, reading 0's and 1's can be cumbersome, so they usually translate to hexadecimal. There are programs called hexdump which come with the Linux distribution or xxd.
While most people prefer to view files through a text editor, you can only conveniently view ASCII files this way. Most text editors will let you look at a binary file (such as an executable), but insert in things that look like ^@ to indicate control characters.
A good hexdump will attempt to translate the hex pairs to printable ASCII if it can. This is interesting because you discover that in, say, executables, many parts of the file are still written in ASCII. So this is a very useful feature to have.
Writing Binary Files, Part 2
Why do people use binary files anyway? One reason is compactness. For example, suppose you wanted to write the number 100000. If you type it in ASCII, this would take 6 characters (which is 6 bytes). However, if you represent it as unsigned binary, you can write it out using 4 bytes.
ASCII is convenient, because it tends to be human-readable, but it can use up a lot of space. You can represent information more compactly by using binary files.
For example, one thing you can do is to save an object to a file. This is a kind of serialization. To dump it to a file, you use a write() method. Usually, you pass in a pointer to the object and the number of bytes used to represent the object (use the sizeof operator to determine this) to the write() method. The method then dumps out the bytes as it appears in memory into a file.
You can then recover the information from the file and place it into the object by using a corresponding read() method which typically takes a pointer to an object (and it should point to an object that has memory allocated, whether it be statically or dynamically allocated) and the number of bytes for the object, and copies the bytes from the file into the object.
Of course, you must be careful. If you use two different compilers, or transfer the file from one kind of machine to another, this process may not work. In particular, the object may be laid out differently. This can be as simple as endianness, or there may be issues with padding.
This way of saving objects to a file is nice and simple, but it may not be all that portable. Furthermore, it does the equivalent of a shallow copy. If your object contains pointers, it will write out the addresses to the file. Those addresses are likely to be totally meaningless. Addresses may make sense at the time a program is running, but if you quit and restart, those addresses may change.
This is why some people invent their own format for storing objects: to increase portability.
But if you know you aren't storing objects that contain pointers, and you are reading the file in on the same kind of computer system you wrote it on, and you're using the same compiler, it should work.
This is one reason people sometimes prefer to write out ints, chars, etc. instead of entire objects. They tend to be somewhat more portable.
Summary
An ASCII file is a binary file that consists of ASCII characters. ASCII characters are 7-bit encodings stored in a byte. Thus, each byte of an ASCII file has its most significant bit set to 0. Think of an ASCII file as a special kind of binary file.
A generic binary file uses all 8-bits. Each byte of a binary file can have the full 256 bitstring patterns (as opposed to an ASCII file which only has 128 bitstring patterns).
There may be a time where Unicode text files becomes more prevalent. But for now, ASCII files are the standard format for text files.
Ascii vs. Binary Files的更多相关文章
- How can I read binary files from Resources
How can I read binary files from Resourceshttp://answers.unity3d.com/questions/8187/how-can-i-read-b ...
- text files and binary files
https://en.wikipedia.org/wiki/Text_file https://zh.wikipedia.org/wiki/文本文件
- mysql(或者mariadb)连接工具HeidiSQL
Some infos around HeidiSQL Project website: http://www.heidisql.com/Google Code: http://code.google. ...
- Linux学习笔记:ftp中binary二进制与ascii传输模式的区别
在使用ftp传输文件时,常添加上一句: binary -- 使用二进制模式传输文件 遂查资料,如下所获. FTP可用多种格式传输文件,通常由系统决定,大多数Linux/UNIX系统只有两种模式:文本 ...
- Linux——grep binary file
原创声明:本文系博主原创文章,转载或引用请注明出处. grep命令是linux下常用的文本查找命令.当grep检索的文件是二进制文件时,grep命令会提示: $grep pattern filenam ...
- Debugging Information in Separate Files
[Debugging Information in Separate Files] gdb allows you to put a program's debugging information in ...
- ftp二进制与ascii传输方式区别
ASCII 和BINARY模式区别: 用HTML 和文本编写的文件必须用ASCII模式上传,用BINARY模式上传会破坏文件,导致文件执行出错. BINARY模式用来传送可执行文件,压缩文 ...
- rpmbuild spec 打包jar变小了、设置禁止压缩二进制文件Disable Binary stripping in rpmbuild
Disable Binary stripping in rpmbuild 摘自:http://livecipher.blogspot.com/2012/06/disable-binary-stripp ...
- reading/writing files in Python
file types: plaintext files, such as .txt .py Binary files, such as .docx, .pdf, iamges, spreadsheet ...
随机推荐
- faster-rcnn 论文讲解
Faster RCN已经将特征抽取(feature extraction),proposal提取,bounding box regression(rect refine),classification ...
- Linux——软件包简单学习笔记
Linux中的是那种软件包: (这里学习是基于redHat的Cent-OS) 1: 二进制软件包管理(RPM.YUM) 2:源代码包安装 3: 脚本安装(Shell或Java脚本) 一: 二进制软件 ...
- MVC动态二级域名解析
动态二级域名的实现: 应用场景:目前产品要实现block功能,因为工作需要实现二级域名:www.{CompanyUrl}.xxx.com 假设产品主域名入口为:www.xxx.com 公司员工管理:w ...
- 《F4+2团队项目需求改进与系统设计》
任务一 a.分析<动态的太阳系模型项目需求规格说明书>初稿的不足. 任务概述描述的有些不具体,功能的规定不详细,在此次作业进行了修改. b.参考<构建之法>8.5节功能的定位和 ...
- 《剑指offer》第十题(斐波那契数列)
// 面试题:斐波那契数列 // 题目:写一个函数,输入n,求斐波那契(Fibonacci)数列的第n项. #include <iostream> using namespace std; ...
- jq expando && $.data()
1.使用隐藏控件或者是js全局变量来临时存储数据,全局变量容易导致命名污染,隐藏控件导致经常读写dom浪费性能 jQuery提供了自己的数据缓存方案,使用jQuery数据缓存方案,我们需要掌握$.da ...
- 【Golang】解决Go test执行单个测试文件提示未定义问题
背景 很多人记录过怎么执行Go test单个文件或者单个函数,但是要么对执行单文件用例存在函数或变量引用的场景避而不谈,要么提示调用了其它文件中的模块会报错.其实了解了go test命令的机制之后,这 ...
- Java 常用对象-BigInteger类
2017-11-02 21:57:09 BigInteger类:不可变的任意精度的整数.所有操作中,都以二进制补码形式表示 BigInteger(如 Java 的基本整数类型).BigInteger ...
- SQLSERVER 对于非dbo的表增加注释
平时我们创建表的时候总是dbo.imsi_collect_state,但是有时候为了方便管理我们可能会创建架构wifi,那么表名就是wifi.imsi_collect_state 原来增加注释的方式是 ...
- C#中类的序列化和反序列化
说明:本文演示将类序列化后写入记事本并从记事本读取反序列化为对象1.首先创建一个类,同时类必须标识为Serializable,如下: [Serializable] public class Regio ...