Ascii vs. Binary Files
Ascii vs. Binary Files
Introduction
Most people classify files in two categories: binary files and ASCII (text) files. You've actually worked with both. Any program you write (C/C++/Perl/HTML) is almost surely an ASCII file.
An ASCII file is defined as a file that consists of ASCII characters. It's usually created by using a text editor like emacs, pico, vi, Notepad, etc. There are fancier editors out there for writing code, but they may not always save it as ASCII.
As an aside, ASCII text files seem very "American-centric". After all, the 'A' in ASCII stands for American. However, the US does seem to dominate the software market, and so effectively, it's an international standard.
Computer science is all about creating good abstractions. Sometimes it succeeds and sometimes it doesn't. Good abstractions are all about presenting a view of the world that the user can use. One of the most successful abstractions is the text editor.
When you're writing a program, and typing in comments, it's hard to imagine that this information is not being stored as characters. Of course, if someone really said "Come on, you don't really think those characters are saved as characters, do you? Don't you know about the ASCII code?", then you'd grudgingly agree that ASCII/text files are really stored as 0's and 1's.
But it's tough to think that way. ASCII files are really stored as 1's and 0's. But what does it mean to say that it's stored as 1's and 0's? Files are stored on disks, and disks have some way to represent 1's and 0's. We merely call them 1's and 0's because that's also an abstraction. Whatever way is used to store the 0's and 1's on a disk, we don't care, provided we can think of them that way.
In effect, ASCII files are basically binary files, because they store binary numbers. That is, ASCII files store 0's and 1's.
The Difference between ASCII and Binary Files?
An ASCII file is a binary file that stores ASCII codes. Recall that an ASCII code is a 7-bit code stored in a byte. To be more specific, there are 128 different ASCII codes, which means that only 7 bits are needed to represent an ASCII character.
However, since the minimum workable size is 1 byte, those 7 bits are the low 7 bits of any byte. The most significant bit is 0. That means, in any ASCII file, you're wasting 1/8 of the bits. In particular, the most significant bit of each byte is not being used.
Although ASCII files are binary files, some people treat them as different kinds of files. I like to think of ASCII files as special kinds of binary files. They're binary files where each byte is written in ASCII code.
A full, general binary file has no such restrictions. Any of the 256 bit patterns can be used in any byte of a binary file.
We work with binary files all the time. Executables, object files, image files, sound files, and many file formats are binary files. What makes them binary is merely the fact that each byte of a binary file can be one of 256 bit patterns. They're not restricted to the ASCII codes.
Example of ASCII files
Suppose you're editing a text file with a text editor. Because you're using a text editor, you're pretty much editing an ASCII file. In this brand new file, you type in "cat". That is, the letters 'c', then 'a', then 't'. Then, you save the file and quit.
What happens? For the time being, we won't worry about the mechanism of what it means to open a file, modify it, and close it. Instead, we're concerned with the ASCII encoding.
If you look up an ASCII table, you will discover the ASCII code for 0x63, 0x61, 0x74 (the 0x merely indicates the values are in hexadecimal, instead of decimal/base 10).
Here's how it looks:
ASCII | 'c' | 'a' | 't' |
Hex | 63 | 61 | 74 |
Binary | 0110 0011 | 0110 0001 | 0111 1000 |
Each time you type in an ASCII character and save it, an entire byte is written which corresponds to that character. This includes punctuations, spaces, and so forth. I recall one time a student has used 100 asterisks in his comments, and these asterisks appeared everywhere. Each asterisk used up one byte on the file. We saved thousands of bytes from his files by removing comments, mostly the asterisks, which made the file look nice, but didn't add to the clarity.
Thus, when you type a 'c', it's being saved as 0110 0011 to a file.
Now sometimes a text editor throws in characters you may not expect. For example, some editors "insist" that each line end with a newline character.
What does that mean? I was once asked by a student, what happens if the end of line does not have a newline character. This student thought that files were saved as two-dimensions (whether the student realized ir or not). He didn't know that it was saved as a one dimensional array. He didn't realize that the newline character defines the end of line. Without that newline character, you haven't reached the end of line.
The only place a file can be missing a newline at the end of the line is the very last line. Some editors allow the very last line to end in something besides a newline character. Some editors add a newline at the end of every file.
Unfortunately, even the newline character is not that universally standard. It's common to use newline characters on UNIX files, but in Windows, it's common to use two characters to end each line (carriage return, newline, which is \r and \n, I believe). Why two characters when only one is necessary?
This dates back to printers. In the old days, the time it took for a printer to return back to the beginning of a line was equal to the time it took to type two characters. So, two characters were placed in the file to give the printer time to move the printer ball back to the beginning of the line.
This fact isn't all that important. It's mostly trivia. The reason I bring it up is just in case you've wondered why transferring files to UNIX from Windows sometimes generates funny characters.
Editing Binary Files
Now that you know that each character typed in an ASCII file corresponds to one byte in a file, you might understand why it's difficult to edit a binary file.
If you want to edit a binary file, you really would like to edit individual bits. For example, suppose you want to write the binary pattern 1100 0011. How would you do this?
You might be naive, and type in the following in a file:
11000011 |
But you should know, by now, that this is not editing individual bits of a file. If you type in '1' and '0', you are really entering in 0x49 and 0x48. That is, you're entering in 0100 1001and 0100 1000 into the files. You're actually (indirectly) typing 8 bits at a time.
"But, how am I suppose to edit binary files?", you exclaim! Sometimes I see this dilemma. Students are told to perform a task. They try to do the task, and even though their solution makes no sense at all, they still do it. If asked to think about whether this solution really works, they might eventually reason that it's wrong, but then they'd ask "But how do I edit a binary file? How do I edit the individual bits?"
The answer is not simple. There are some programs that allow you type in 49, and it translates this to a single byte, 0100 1001, instead of the ASCII code for '4' and '9'. You can call these programs hex editors. Unfortunately, these may not be so readily available. It's not too hard to write a program that reads in an ASCII file that looks like hex pairs, but then converts it to a true binary file with the corresponding bit patterns.
That is, it takes a file that looks like:
63 a0 de |
and converts this ASCII file to a binary file that begins 0110 0011 (which is 63 in binary). Notice that this file is ASCII, which means what's really stored is the ASCII code for '6', '3', ' ' (space), 'a', '0', and so forth. A program can read this ASCII file then generate the appropriate binary code and write that to a file.
Thus, the ASCII file might contain 8 bytes (6 for the characters, 2 for the spaces), and the output binary file would contain 3 bytes, one byte per hex pair.
Viewing Binary Files
Most operating systems come with some program that allows you to view a file in "binary" format. However, reading 0's and 1's can be cumbersome, so they usually translate to hexadecimal. There are programs called hexdump which come with the Linux distribution or xxd.
While most people prefer to view files through a text editor, you can only conveniently view ASCII files this way. Most text editors will let you look at a binary file (such as an executable), but insert in things that look like ^@ to indicate control characters.
A good hexdump will attempt to translate the hex pairs to printable ASCII if it can. This is interesting because you discover that in, say, executables, many parts of the file are still written in ASCII. So this is a very useful feature to have.
Writing Binary Files, Part 2
Why do people use binary files anyway? One reason is compactness. For example, suppose you wanted to write the number 100000. If you type it in ASCII, this would take 6 characters (which is 6 bytes). However, if you represent it as unsigned binary, you can write it out using 4 bytes.
ASCII is convenient, because it tends to be human-readable, but it can use up a lot of space. You can represent information more compactly by using binary files.
For example, one thing you can do is to save an object to a file. This is a kind of serialization. To dump it to a file, you use a write() method. Usually, you pass in a pointer to the object and the number of bytes used to represent the object (use the sizeof operator to determine this) to the write() method. The method then dumps out the bytes as it appears in memory into a file.
You can then recover the information from the file and place it into the object by using a corresponding read() method which typically takes a pointer to an object (and it should point to an object that has memory allocated, whether it be statically or dynamically allocated) and the number of bytes for the object, and copies the bytes from the file into the object.
Of course, you must be careful. If you use two different compilers, or transfer the file from one kind of machine to another, this process may not work. In particular, the object may be laid out differently. This can be as simple as endianness, or there may be issues with padding.
This way of saving objects to a file is nice and simple, but it may not be all that portable. Furthermore, it does the equivalent of a shallow copy. If your object contains pointers, it will write out the addresses to the file. Those addresses are likely to be totally meaningless. Addresses may make sense at the time a program is running, but if you quit and restart, those addresses may change.
This is why some people invent their own format for storing objects: to increase portability.
But if you know you aren't storing objects that contain pointers, and you are reading the file in on the same kind of computer system you wrote it on, and you're using the same compiler, it should work.
This is one reason people sometimes prefer to write out ints, chars, etc. instead of entire objects. They tend to be somewhat more portable.
Summary
An ASCII file is a binary file that consists of ASCII characters. ASCII characters are 7-bit encodings stored in a byte. Thus, each byte of an ASCII file has its most significant bit set to 0. Think of an ASCII file as a special kind of binary file.
A generic binary file uses all 8-bits. Each byte of a binary file can have the full 256 bitstring patterns (as opposed to an ASCII file which only has 128 bitstring patterns).
There may be a time where Unicode text files becomes more prevalent. But for now, ASCII files are the standard format for text files.
Ascii vs. Binary Files的更多相关文章
- How can I read binary files from Resources
How can I read binary files from Resourceshttp://answers.unity3d.com/questions/8187/how-can-i-read-b ...
- text files and binary files
https://en.wikipedia.org/wiki/Text_file https://zh.wikipedia.org/wiki/文本文件
- mysql(或者mariadb)连接工具HeidiSQL
Some infos around HeidiSQL Project website: http://www.heidisql.com/Google Code: http://code.google. ...
- Linux学习笔记:ftp中binary二进制与ascii传输模式的区别
在使用ftp传输文件时,常添加上一句: binary -- 使用二进制模式传输文件 遂查资料,如下所获. FTP可用多种格式传输文件,通常由系统决定,大多数Linux/UNIX系统只有两种模式:文本 ...
- Linux——grep binary file
原创声明:本文系博主原创文章,转载或引用请注明出处. grep命令是linux下常用的文本查找命令.当grep检索的文件是二进制文件时,grep命令会提示: $grep pattern filenam ...
- Debugging Information in Separate Files
[Debugging Information in Separate Files] gdb allows you to put a program's debugging information in ...
- ftp二进制与ascii传输方式区别
ASCII 和BINARY模式区别: 用HTML 和文本编写的文件必须用ASCII模式上传,用BINARY模式上传会破坏文件,导致文件执行出错. BINARY模式用来传送可执行文件,压缩文 ...
- rpmbuild spec 打包jar变小了、设置禁止压缩二进制文件Disable Binary stripping in rpmbuild
Disable Binary stripping in rpmbuild 摘自:http://livecipher.blogspot.com/2012/06/disable-binary-stripp ...
- reading/writing files in Python
file types: plaintext files, such as .txt .py Binary files, such as .docx, .pdf, iamges, spreadsheet ...
随机推荐
- Qt5.2+opencv2.4.9配置安装过程
Qt5.2+Opencv2.4.9的安装与配置 安装环境 Win10系统 Qt5.2.0 Opencv2.4.9 1. 安装Qt5.2.0 安装在D:\Qt\5.2.0文件夹(记为A文件夹) Qt下载 ...
- hibernate报错 java.lang.StackOverflowError: null
在使用hibernate时,报错 java.lang.StackOverflowError: null 把当前线程的栈打满了 java.lang.StackOverflowError: null at ...
- .net 与 java 开发微服务对比
java+spring boot+maven对比.net 优势: 1. spring 自身带的ioc 比.net 更简单易用. 2. spring actuator的健康检测等运行时状态查看功能很赞. ...
- React Native 之轮播图swiper组件
注释:swiper组件是第三方组件 所以在使用之前应该先在命令行安装,然后将第三方的模块引入(第三方模块地址:https://github.com/leecade/react-native-swipe ...
- Linux Shell学习笔记(一)
Shell,见名知意,就是一个作为用户与Linux OS间接口的程序,允许用户向OS输入需要执行的命令.Shell众多,这里只介绍Bash. 0)实验的Shell版本 显示shell版本: /bin/ ...
- idea 2018注册码(激活码)永久性的
2DZ8RPRSBU-eyJsaWNlbnNlSWQiOiIyRFo4UlBSU0JVIiwibGljZW5zZWVOYW1lIjoiY24gdHUiLCJhc3NpZ25lZU5hbWUiOiIiL ...
- Django本地开发,debug模式引用静态文件
debug为true ,不用设置static_root debug 为false ,设置static_root STATIC_ROOT = ( os.path.join(BASE_DIR, 'stat ...
- password_hash加密
每次执行 password_hash('123456', PASSWORD_BCRYPT) 语句后,得到哈希值都不一样! 给密码做哈希之前,会先加入一个随机子串,因为加入的随机子串每次是不一样的,所以 ...
- Java 常用对象-基本类型的封装类
2017-11-04 20:39:26 基本类型封装类:基本类型的封装类的好处是可以在对象中定义更多的功能方法操作该数据. 常用操作之一:用于基本数据类型与字符串的转换. 基本类型和包装类的对应: b ...
- spring boot: spring-data-jpa (Repository/CrudRepository) 数据库操作, @Entity实体类持久化
SpringBoot实现的JPA封装了JPA的特性, Repository是封装了jpa的特性(我是这么理解的) 1在pom.xml引入mysql, spring-data-jpa依赖 2.在src/ ...