今天看到这篇文章。颇为震撼。感叹算法之“神通”。

借助于合适的算法能够完毕看似不可能的事情。

最早这个问题是在Stack Overflow站点上面给出的(Sorting numbers in RAM):

题目:

提供一个1M的ROM和1M的RAM,一个输入流和一个输出流。

程序代码终于烧录在1M的ROM中,程序能够使用1M的RAM进行运算。输入流中依次输入100万个8位的整数。要求输出流中输出这100万个数排序后的结果。

已经能够搜索到非常多解法了,今天看到一个国外的程序猿的分析,认为非常有趣,想把他的分析过程简单转在这里。简单一看,根本不可能。100万个8位数不管怎样也不能在1M的内存里装下。

排序过程是利用归并排序。效率较高。最难的地方是在怎样将排序后的100万个数字存下来。换一种思路,不一定要存下每个数字本身,由于数字都已经排序了。那么相邻两个数字之间的差值是非常小的,假设在极端的情况下,两个数字之间的差值非常大,那么必定会有很多其它量的相邻数字之间的差值更小,因此存下全部的100万个数字的差值须要的空间是能够估算的。

平均每个差值的大小为:10^8/100万 = 100,100须要7个bit位来表示,因此共须要100万×7 = 875, 000 字节,不到1M的空间,可是还有非常多大于128的数值须要编码,这样有一些数值的编码大于7个bit。因此,接下来的问题是怎样编码这100万个差值,能尽可能压缩空间,作者举出了算数编码来解决这一问题。选择一种简单的编码规则,即:看第一个bit位,假设是0,则后6个bit位表示数值。假设是1,则表示差值为64,继续读取后一位,假设仍然为1,则差值继续累加1。直到读到0,然后读取后面的6个bit,这样能够表示全部可能出现的值,这样下来,终于计算出所须要的内存为1070312.5bytes,仍然大于1M。

终于採用了一个针对该问题的哈夫曼编码解决。算数编码看懂了,哈夫曼编码也似乎看明确了,可是如何运用到解决本问题中,还不是太明确。

同一时候作者也给出了339行的解决这个问题的实战代码。

以下是详细的解决的方法,本人的翻译能力有限。无法将原文的意思全然的表达,希望读者自己去理解当中的意思。

整体来说,这样的思路绝对是一种超脱俗,绝对的不可思议。

Arithmetic Coding and the 1MB Sorting Problem

It’s been two weeks since the 1MB Sorting Problem was originally featured on Reddit, and in case you don’t think this
artificial problem has been thoroughly stomped into the ground yet, here’s a continuation of last post’s explanation of the working C++ program which solves it.

In that post, I gave a high-level outline of the approach, and showed an encoding scheme – which I’ve since learned is Golomb coding – which comes close
to meeting the memory requirements, but doesn’t quite fit. Arithmetic coding, on the other hand, does fit. It’s interesting, because this problem, as it was phrased, almost seems designed to force you into arithmetic coding (though Nick
Cleaton’s solution
 manages to avoid it).

I had read about arithmetic coding before, but I never had any reason to sit down to implement it. It always struck me as kind of mystical: How the heck you encode information in a fraction of
a bit, anyway? The 1MB Sorting Problem turned out to be a great excuse to learn how arithmetic coding works.

How Many Sorted Sequences Even Exist?

It’s important to note that the whole reason why we are able to represent a sorted sequence
of one million 8-digit numbers in less than 1 MB of memory is because mathematically, there simply aren’t that many different sorted sequences which exist.

To see this, consider the following method of encoding a sorted sequence as a row of boxes, read from left to right.

An empty box tells us to increment the current value (initially 0), and a box with a dot inside it tells us to return the current value as part of the sorted sequence. For example, the above row of boxes corresponds to the sorted sequence {
3, 7, 7, 10, 15, 16, … }.

Now, to encode one million 8-digit numbers, we’d need exactly 1000000 boxes containing dots, plus a maximum of 99999999 empty boxes. In fact, when there are exactly 99999999 + 1000000 boxes in total, we can encode every
possible sorted sequence using a unique distribution of dots. The question then becomes: How many different ways are there to distribute those dots? This is a straightforward number
of combinations
problem:

(99999999+10000001000000)≈28093729.481...

That’s a lot of combinations. Now, think of the contents of memory as one giant label which represents exactlyone of
those combinations. The exponent on the 2, above, gives us a lower limit for how many bits of memory would be required to come up with a unique label for every possible combination. In this case, it can’t be done using fewer than 8093730 bits, or 1011717 bytes.

That’s the fundamental limit. No encoding scheme can ever do better than that; it would be like trying to uniquely label every state in the USA using fewer than 6 bits. On the bright side, 1011717 bytes is comfortably less than our 1048576 byte limit, which
is encouraging.

The Probability of Encountering Each Delta Value

In the last post, we saw the potential of encoding delta
values – the differences between numbers in a sorted sequence. Thinking in terms of the above rows of boxes, let’s take a look at the probability of encountering each delta value.

Since we know that there are 99999999 empty boxes, and 1000000 boxes containing dots, the probability of any particular box containing a dot is just:

D=100000099999999+1000000≈.00990099019703...

For simplicity, let’s now imagine an infinite row of boxes, with dots occurring at the same frequency as this. The probability of encountering a delta value of 0 is,
then, the same as the probability of a box containing a dot, which is just D.

P(0)=D

How about a delta value of 1? Well, the probability of the first box being empty is (1 - D), while the probability of the second box containing a dot is still just D. Since each outcome is an independent
event
, we can multiply those probabilities together. The probability of encountering a delta value of 1,
then, is

P(1)=(1−D)×D≈.00980296059015...

And in general, the probability encountering a delta value of N is

P(N)=(1−D)N×D

Now, let’s draw the real number line in the interval [0, 1), and let’s subdivide it into partitions according to the probabilities of each delta value. They begin quite small – you can see the first three partitions for delta values0, 1 and 2 squished
all the way to the left – and they get infintessimally smaller as we proceed to the right, as larger delta values are exponentially less likely to occur.

If you were to throw a dart at this number line, the likelihood of hitting each partition is about the same as the likelihood of encountering each delta value in one of our sorted sequences.

That’s exactly the kind of information that’s useful for arithmetic coding.

The Idea Behind Arithmetic Encoding

Arithmetic encoding is able to encode a sequence of elements – in this case, delta values – by progressively subdividing the real number line into finer and finer partitions.
At each step, the relative width of each partition is determined by the probability of encountering each element.

Suppose the first delta value in the sequence is 27. We begin by locating the corresponding
partition in the original number line, and zooming into it. This gives us a new interval of
the real number line to work with – in this case, roughly from .236 to .243 – which we can then subdivide further. Let’s use the same proportions we used for the first element.

Suppose the next element in the sequence is 39. Again, we locate the corresponding partition
and zoom in, subdividing the interval into even finer partitions.

In this way, the interval gets progressively smaller and smaller. We repeat these steps one million times: once for each element in the sequence. After that, all we need to store is a single real value which lies somewhere within the final partition. This value
will unambiguously identify the entire one-million-element sequence.

As you can imagine, to represent this value, it’s going to take a lot of precision. Hundreds of thousands times more precision than you’ll find in any single- or even double-precision floating-point value. What we need is a way to represent a fractional value
having millions of significant digits. And in arithmetic coding, that’s exactly what the final
encoded bit stream is. It’s one giant binary fraction having millions of binary digits, pinpointing a specific value somewhere within the interval [0, 1) with laser precision.

That’s great, you might be thinking, but how the heck do we even work with numbers that precise?

This post has already become quite long, so I’ll answer that question in a separate post. You don’t even have to wait, because it’s already published: See Arithmetic
Encoding Using Fixed-Point Math
 for the thrilling conclusion!

以上内容来源:http://preshing.com/20121105/arithmetic-coding-and-the-1mb-sorting-problem/

怎样使用1M的内存排序100万个8位数的更多相关文章

  1. 100万并发连接服务器笔记之Java Netty处理1M连接会怎么样

    前言 每一种该语言在某些极限情况下的表现一般都不太一样,那么我常用的Java语言,在达到100万个并发连接情况下,会怎么样呢,有些好奇,更有些期盼.这次使用经常使用的顺手的netty NIO框架(ne ...

  2. SQLServer如何快速生成100万条不重复的随机8位数字

    最近在论坛看到有人问,如何快速生成100万不重复的8位编号,对于这个问题,有几点是需要注意的: 1.    如何生成8位随机数,生成的数越随机,重复的可能性当然越小 2.    控制不重复 3.    ...

  3. Stackful 协程库 libgo(单机100万协程)

    libgo 是一个使用 C++ 编写的协作式调度的stackful协程库, 同时也是一个强大的并行编程库. 设计之初是为高并发分布式Linux服务端程序开发提供底层框架支持,可以让链接进程序的同步的第 ...

  4. 极限挑战—C#100万条数据导入SQL SERVER数据库仅用4秒 (附源码)

    原文:极限挑战-C#100万条数据导入SQL SERVER数据库仅用4秒 (附源码) 实际工作中有时候需要把大量数据导入数据库,然后用于各种程序计算,本实验将使用5中方法完成这个过程,并详细记录各种方 ...

  5. Qt中提高sqlite的读写速度(使用事务一次性写入100万条数据)

    SQLite数据库本质上来讲就是一个磁盘上的文件,所以一切的数据库操作其实都会转化为对文件的操作,而频繁的文件操作将会是一个很好时的过程,会极大地影响数据库存取的速度.例如:向数据库中插入100万条数 ...

  6. C#100万条数据导入SQL SERVER数据库仅用4秒 (附源码)

    作者: Aicken(李鸣)  来源: 博客园  发布时间: 2010-09-08 15:00  阅读: 4520 次  推荐: 0                   原文链接   [收藏] 摘要: ...

  7. python 统计MySQL大于100万的表

    一.需求分析 线上的MySQL服务器,最近有很多慢查询.需要统计出行数大于100万的表,进行统一优化. 需要筛选出符合条件的表,统计到excel中,格式如下: 库名 表名 行数 db1 users 1 ...

  8. Netty 100万级到亿级流量 高并发 仿微信 IM后台 开源项目实战

    目录 写在前面 亿级流量IM的应用场景 十万级 单体IM 系统 高并发分布式IM系统架构 疯狂创客圈 Java 分布式聊天室[ 亿级流量]实战系列之 -10[ 博客园 总入口 ] 写在前面 ​ 大家好 ...

  9. Netty 100万级高并发服务器配置

    前言 每一种该语言在某些极限情况下的表现一般都不太一样,那么我常用的Java语言,在达到100万个并发连接情况下,会怎么样呢,有些好奇,更有些期盼. 这次使用经常使用的顺手的netty NIO框架(n ...

随机推荐

  1. perl学习之子程序

    一.定义子程序即执行一个特殊任务的一段分离的代码,它可以使减少重复代码且使程序易读.PERL中,子程序可以出现在程序的任何地方.定义方法为:sub subroutine{statements;}二.调 ...

  2. [转载] Python数据类型知识点全解

    [转载] Python数据类型知识点全解 1.字符串 字符串常用功能 name = 'derek' print(name.capitalize()) #首字母大写 Derek print(name.c ...

  3. python的部分内置函数

    内置函数思维导图:https://www.processon.com/mindmap/5c10ca52e4b0c2ee256ac034 内置函数 匿名函数 匿名函数统一的名字是:<lambda& ...

  4. uboot的Makefile裁剪(针对飞思卡尔的mx6系列)

    VERSION = 2009PATCHLEVEL = 08SUBLEVEL =EXTRAVERSION =ifneq "$(SUBLEVEL)" ""U_BOO ...

  5. LeetCode(113) Path Sum II

    题目 Given a binary tree and a sum, find all root-to-leaf paths where each path's sum equals the given ...

  6. fdisk分区自动挂载

    理解/etc/fstab文件配置 首先打开这个文件我们查看下本身内容 vi /etc/fstab   或者   vim /etc/fstab 2 介绍下fstab配置 文件配置每一行属于一个配置,每个 ...

  7. iOS学习笔记13-网络(二)NSURLSession

    在2013年WWDC上苹果揭开了NSURLSession的面纱,将它作为NSURLConnection的继任者.现在使用最广泛的第三方网络框架:AFNetworking.SDWebImage等等都使用 ...

  8. 【Luogu】P1681最大正方形2(异或运算,DP)

    题目链接 不得不说attack是个天才.读入使用异或运算,令que[i][j]^=(i^j)&1,于是原题目变成了求que数组的最大相同值. 然而我还是不理解为啥,而且就算简化成这样我也不会做 ...

  9. java面试题之Thread的run()和start()方法有什么区别

    run()方法: 是在主线程中执行方法,和调用普通方法一样:(按顺序执行,同步执行) start()方法: 是创建了新的线程,在新的线程中执行:(异步执行) public class App { pu ...

  10. virtualbox中centos虚拟机网络配置

    本文讲述的是如何在Oracle VM VirtualBox安装的CentOS虚拟机中进行网络配置,使得虚拟机可以访问宿主主机,也能访问外网,宿主主机可以访问虚拟机,虚拟机之间也可以相互访问. 在Vir ...