从无重复大数组找TOP N元素的最优解说起

有一类面试题，既可以考察工程师算法、也可以兼顾实践应用、甚至创新思维，这些题目便是好的题目，有区分度表现为可以有一般解，也可以有最优解。最近就发现了一个这样的好题目，拿出来晒一晒。

1 题目

原文：

There is an array of 10000000 different int numbers. Find out its largest 100 elements. The implementation should be optimized for executing speed.

翻译：

有一个长度为1000万的int数组，各元素互不重复。如何以最快的速度找出其中最大的100个元素？

2 分析与解

（接下来的算法均以Java语言实现。）

首先，第一个冒出来的想法是——排序。各种排序算法对数组进行一次sort，然后limit出max的100个即可，时间复杂度为O(nLogN)。

2.1 堆排序思路

我以堆排序来实现这个题目，这样可以使用非常少的内存空间，始终维护一个100个元素大小的最小堆，堆顶int[0]即是100个元素中最小的，插入一个新的元素的时候，将这个元素和堆顶int[0]进行交换，也就是淘汰掉堆顶，然后再维护一个最小堆，使int[0]再次存储最小的元素，循环往复，不断迭代，最终剩下的100个元素就是结果，该算法时间复杂度仍然是O(nLogN)，优点在于节省内存空间，算法时间复杂度比较理想，平均耗时400ms。

代码实现如下，

import java.util.ArrayList;

import java.util.Collections;

import java.util.List;

/**

 * Implementation of finding top 100 elements out of a huge int array. <br>

 *

 * There is an array of 10000000 different int numbers. Find out its largest 100

 * elements. The implementation should be optimized for executing speed. <br>

 *

 * Note: This is the third version of implementation, this time I make the best out

 * of the heap sort algorithm by using a minimum heap. The heap maintains the top biggest

 * numbers that guarantees the minimum number is removed every time a new number is added

 * to the heap. It saves memory usage to the limit by just using an array which size is 101

 * and a few temp elements. However, the performance is not as good as the bit map way but

 * better than the multiple thread way.

 *

 * @author zhangxu04

 */

public class FindTopElements3 {

	private static final int ARRAY_LENGTH = 10000000; // big array length

	public static void main(String[] args) {

		FindTopElements3 fte = new FindTopElements3();

		// Get a array which is not in order and elements are not duplicate

		int[] array = getShuffledArray(ARRAY_LENGTH);

		// Find top 100 elements and print them by desc order in the console

		long start = System.currentTimeMillis();

		fte.findTop100(array);

		long end = System.currentTimeMillis();

		System.out.println("Costs " + (end - start) + "ms");

	}

	public void findTop100(int[] arr) {

		MinimumHeap heap = new MinimumHeap(100);

		for (Integer i : arr) {

			heap.add(i);

			if (heap.size() > 100) {

				heap.deleteTop();

			}

		}

		for (int i = 0; i < 100; i++) {

			System.out.println(heap.deleteTop());

		}

	}

	/**

	 * Get shuffled int array

	 *

	 * @return array not in order and elements are not duplicate

	 */

	private static int[] getShuffledArray(int len) {

		System.out

				.println("Start to generate test array... this may take several seconds.");

		List<Integer> list = new ArrayList<Integer>(len);

		for (int i = 0; i < len; i++) {

			list.add(i);

		}

		Collections.shuffle(list);

		int[] ret = new int[len];

		for (int i = 0; i < len; i++) {

			ret[i] = list.get(i);

		}

		return ret;

	}

}

class MinimumHeap {

	int[] items;

	int size;

	public MinimumHeap(int size) {

		items = new int[size + 1];

		size = 0;

	}

	void shiftUp(int index) {

		int intent = items[index];

		while (index > 0) {

			int pindex = (index - 1) / 2;

			int parent = items[pindex];

			if (intent < parent) {

				items[index] = parent;

				index = pindex;

			} else {

				break;

			}

		}

		items[index] = intent;

	}

	void shiftDown(int index) {

		int intent = items[index];

		int leftIndex = 2 * index + 1;

		while (leftIndex < size) {

			int minChild = items[leftIndex];

			int minIndex = leftIndex;

			int rightIndex = leftIndex + 1;

			if (rightIndex < size) {

				int rightChild = items[rightIndex];

				if (rightChild < minChild) {

					minChild = rightChild;

					minIndex = rightIndex;

				}

			}

			if (minChild < intent) {

				items[index] = minChild;

				index = minIndex;

				leftIndex = index * 2 + 1;

			} else {

				break;

			}

		}

		items[index] = intent;

	}

	public void add(int item) {

		items[size++] = item;

		shiftUp(size - 1);

	}

	public int deleteTop() {

		if (size < 1) {

			return 0;

		}

		int maxItem = items[0];

		int lastItem = items[size - 1];

		size--;

		if (size < 1) {

			return lastItem;

		}

		items[0] = lastItem;

		shiftDown(0);

		return maxItem;

	}

	public boolean isEmpty() {

		return size < 1;

	}

	public int size() {

		return size;

	}

	/**

	 * MinimumHeap main test

	 * @param args

	 */

	public static void main(String[] args) {

		MinimumHeap heap = new MinimumHeap(7);

		heap.add(2);

		heap.add(3);

		heap.add(5);

		heap.add(1);

		heap.add(4);

		heap.add(7);

		heap.add(6);

		heap.deleteTop();

		heap.deleteTop();

		while (!heap.isEmpty()) {

			System.out.println(heap.deleteTop());

		}

	}

}

那么挖掘下题目，两个点是我们的优化线索：

1、元素互不重复

2、最快的速度，没有提及对于系统资源以及空间的要求

2.2 多线程分而治之策略

顺着#2条线索，可以给出一个多线程的优化版本，使用分而治之的策略，将1000万大小的数组分割为1000个元素组成的若干小数组，利用JDK自带的高效排序算法void java.util.Arrays.sort(int[] a)来进行排序，多线程处理，主线程汇总结果后取出各个小数组的top 100，归并后再进行一次排序得出结果。

代码实现如下，

import java.util.ArrayList;

import java.util.Arrays;

import java.util.Collections;

import java.util.List;

import java.util.concurrent.Callable;

import java.util.concurrent.CompletionService;

import java.util.concurrent.ExecutorCompletionService;

import java.util.concurrent.ExecutorService;

import java.util.concurrent.Executors;

/**

 * Implementation of finding top 100 elements out of a huge int array. <br>

 *

 * There is an array of 10000000 different int numbers.

 * Find out its largest 100 elements.

 * The implementation should be optimized for executing speed.

 *

 * @author zhangxu04

 */

public class FindTopElements {

	private static final int ARRAY_LENGTH = 10000000; // big array length

	private static final int ELEMENT_NUM_PER_GROUP = 10000; // split big array into sub-array, this represents sub-array length

	private static final int TOP_ELEMENTS_NUM = 100; // top elements number

	private ExecutorService executorService;

	private CompletionService<int[]> completionService;

	public FindTopElements() {

		int MAX_THREAD_COUNT = 50;

		executorService = Executors.newFixedThreadPool(MAX_THREAD_COUNT);

		completionService = new ExecutorCompletionService<int[]>(executorService);

	}

	/**

	 * Start from here :-)

	 * @param args

	 */

	public static void main(String[] args) {

		FindTopElements findTopElements = new FindTopElements();

		// Get a array which is not in order and elements are not duplicate

		int[] array = getShuffledArray(ARRAY_LENGTH);

		// Find top 100 elements and print them by desc order in the console

		long start = System.currentTimeMillis();

		findTopElements.findTop100(array);

		long end = System.currentTimeMillis();

		System.out.println("Costs " + (end - start) + "ms");

	}

	/**

	 * Leveraging concurrent components of JDK, we can deal small parts of the huge array concurrently.

	 * The huge array are split into several sub arrays which are submitted to a thread pool one by one.

	 * By using <code>CompletionService</code>, we can take out completed result from the pool as soon as possible,

	 * which avoid the block issue when getting future result through a future task list by using

	 * <code>ExcutorService</code> and <code>Future</code> class. Moreover, the can optimize the performance of

	 * the piece of code by processing the completed results once we get them, so the overall sort invocation will

	 * not be delayed to the final moment.

	 *

	 */

	private void findTop100(int[] arr) {

		System.out.println("Start to compute.");

		int groupNum = (ARRAY_LENGTH / ELEMENT_NUM_PER_GROUP);

		System.out.println("Split " + ARRAY_LENGTH + " elements into " + groupNum + " groups");

		for (int i = 0; i < groupNum; i++) {

			int[] toBeSortArray = new int[ELEMENT_NUM_PER_GROUP];

			System.arraycopy(arr, i * ELEMENT_NUM_PER_GROUP, toBeSortArray, 0, ELEMENT_NUM_PER_GROUP);

			completionService.submit(new FindTop100(toBeSortArray));

		}

		try {

			int[] overallArray = new int[TOP_ELEMENTS_NUM * groupNum];

			for (int i = 0; i < groupNum; i++) {

				System.arraycopy(completionService.take().get(), 0, overallArray, i * TOP_ELEMENTS_NUM, TOP_ELEMENTS_NUM);

			}

			Arrays.sort(overallArray);

			for (int i = 1; i <= TOP_ELEMENTS_NUM; i++) {

				System.out.println(overallArray[TOP_ELEMENTS_NUM * groupNum - i]);

			}

			System.out.println("Finish to output result.");

		} catch (Exception e) {

			e.printStackTrace();

		}

		executorService.shutdown();

	}

	/**

	 * Callable of finding top 100 elements <br>

	 * The steps are as below:

	 * 1) Quick sort a array

	 * 2) Get reverse 100 elements and put them into a new array

	 * 3) return the new array

	 */

	private class FindTop100 implements Callable<int[]> {

		private int[] array;

		public FindTop100(int[] array) {

			this.array = array;

		}

		@Override

		public int[] call() throws Exception {

			int len = array.length;

			Arrays.sort(array);

			int[] result = new int[TOP_ELEMENTS_NUM];

			int index = 0;

			for (int i = 1; i <= TOP_ELEMENTS_NUM; i++) {

				result[index++] = array[len - i];

			}

			return result;

		}

	}

	/**

	 * Get shuffled int array

	 *

	 * @return array not in order and elements are not duplicate

	 */

	private static int[] getShuffledArray(int len) {

		System.out.println("Start to generate test array... this may take several seconds.");

		List<Integer> list = new ArrayList<Integer>(len);

		for (int i = 0; i < len; i++) {

			list.add(i);

		}

		Collections.shuffle(list);

		int[] ret = new int[len];

		for (int i = 0; i < len; i++) {

			ret[i] = list.get(i);

		}

		return ret;

	}

}

分析看来，这个解的优势在于充分利用了系统资源，使用了分而治之的思想，将时间复杂度平均分配到了每个子线程中，但是代码中大量用到了System.arraycopy进行数组拷贝，占用内存过于多，甚至需要指定JVM的内存-Xmx才可以正常运行起来，平均耗时250ms。

2.3 位图数组思路

这个思路属于比较创新的方式，考虑到优化线索#1提到的无重复元素，那么可以使用位图数组存储元素，一个int占用4个字节，32个bit，也就是说1个int可以表示32个数字的位置。维护一个数组长度/32+1的位图数组x，遍历给定的数组，将数字安插进入这个位图数组x中，例如int[0]=62，那么

index=62/32=1

bit_index=62 mod 32 = 30

那么就置位图数组的x[1]=x[1] | 30，采用“位或”是为了不丢掉以前处理过的数字。

代码实现如下，

import java.util.ArrayList;

import java.util.Collections;

import java.util.List;

/**

 * Implementation of finding top 100 elements out of a huge int array. <br>

 *

 * There is an array of 10000000 different int numbers. Find out its largest 100

 * elements. The implementation should be optimized for executing speed. <br>

 *

 * Note: This is the second version of implementation, the previous one using

 * thread pool provided by JDK concurrent toolkit is not efficient enough, the

 * second version is an enhanced one based on bit map algorithm, which is estimated to

 * have at least a 3 times faster and consume less memory usage.

 *

 * @author zhangxu04

 */

public class FindTopElements2 {

	private static final int ARRAY_LENGTH = 10000000; // big array length

	public static void main(String[] args) {

		FindTopElements2 fte = new FindTopElements2(ARRAY_LENGTH + 1);

		// Get a array which is not in order and elements are not duplicate

		int[] array = getShuffledArray(ARRAY_LENGTH);

		// Find top 100 elements and print them by desc order in the console

		long start = System.currentTimeMillis();

		fte.findTop100(array);

		long end = System.currentTimeMillis();

		System.out.println("Costs " + (end - start) + "ms");

	}

	private final int[] bitmap;

	private final int size;

	public FindTopElements2(final int size) {

		this.size = size;

		int len = ((size % 32) == 0) ? size / 32 : size / 32 + 1;

		this.bitmap = new int[len];

	}

	private static int index(final int number) {

		return number / 32;

	}

	private static int position(final int number) {

		return number % 32;

	}

	private void adjustBitMap(final int index, final int position) {

		int bit = bitmap[index] | (1 << position);

		bitmap[index] = bit;

	}

	public void add(int[] numArr) {

		for (int i = 0; i < numArr.length; i++) {

			add(numArr[i]);

		}

	}

	public void add(int number) {

		adjustBitMap(index(number), position(number));

	}

	public boolean getIndex(final int index) {

		if (index > size) {

			return false;

		}

		int bit = (bitmap[index(index)] >> position(index)) & 0x0001;

		return (bit == 1);

	}

	private void findTop100(int[] arr) {

		System.out.println("Start to compute.");

		add(arr);

		int[] result = new int[100];

		int index = 0;

		for (int i = bitmap.length - 1; i >= 0; i--) {

			for (int j = 31; j >= 0; j--) {

				if (((bitmap[i] >> j) & 0x0001) == 1) {

					if (index == result.length) {

						break;

					}

					result[index++] = ((i) * 32) + j ;

				}

			}

			if (index == result.length) {

				break;

			}

		}

		for (int j = 0; j < result.length; j++) {

			System.out.println(result[j]);

		}

		System.out.println("Finish to output result.");

	}

	/**

	 * Get shuffled int array

	 *

	 * @return array not in order and elements are not duplicate

	 */

	private static int[] getShuffledArray(int len) {

		System.out.println("Start to generate test array... this may take several seconds.");

		List<Integer> list = new ArrayList<Integer>(len);

		for (int i = 0; i < len; i++) {

			list.add(i);

		}

		Collections.shuffle(list);

		int[] ret = new int[len];

		for (int i = 0; i < len; i++) {

			ret[i] = list.get(i);

		}

		return ret;

	}

}

这个算法的时间复杂度是O(N)，非常理想，平均耗时可以减少到50ms作用，性能比排序算法提升了10倍以上，不足在于位图数组的长度取决于给定数组的最大值，如果分布比较平均，并且最大值比较小，那么占用内存空间就可以得到有效的控制。

3 总结

综上给出的题目，可以看出解决一个实际问题，既可以用纯算法的思路来解决，我们甚至可以自己动手实现，例如自己写的堆排序，非常节省空间，如果用JDK自带的快速排序，那么无疑这一点不会好于我们的实现。现今，处理大数据问题一个倾向的思路就是充分利用系统资源，充分发挥多核、大内存计算型服务器的能力，为我们提高效率，多线程是在JAVA中以及有了非常好用的API以及concurrent包下的工具类，能否有效利用这些工具提速我们的程序也很关键。同时，问题总有一些点可以让我们找到最适合的场景来解决，例如位图数组的思路，在性能上达到了最佳，同时多消耗的内存对于现代的服务器来说完全在可控范围内，因此不失为一种创新的好思路。

从无重复大数组找TOP N元素的最优解说起的更多相关文章

疯狂位图之——位图实现12GB无重复大整数集排序
<Programming Pearls>(编程珠玑)第一章讲述了如何用位图排序无重复的数据集,整个思想很简洁,今天实践了下. 一.主要思想位图排序的思想就是在内存中申请一块连续的空间作为 ...
LeetCode 第三题--无重复字符的最长子串
1. 题目 2.题目分析与思路 3.思路 1. 题目输入: "abcabcbb" 输出: 3 解释: 因为无重复字符的最长子串是 "abc",所以其长度为 3 ...
PHP性能优化：in_array和isset 在大数组查询中耗时相差巨大，以及巧妙使用array_flip
今天在PHP业务开发中,发现了一个问题. 两个较大数组(20万+元素),遍历其中一个$a,另一个数组$b用于查找元素. 比如 foreach($a as $val){ if(in_array($xx, ...
疯狂位图之——位图生成12GB无重复随机乱序大整数集
上一篇讲述了用位图实现无重复数据的排序,排序算法一下就写好了,想弄个大点数据测试一下,因为小数据在内存中快排已经很快. 一.生成的数据集要求 1.数据为0--2147483647(2^31-1)范围内 ...
算法练习之合并两个有序链表, 删除排序数组中的重复项,移除元素,实现strStr(),搜索插入位置,无重复字符的最长子串
最近在学习java,但是对于数据操作那部分还是不熟悉因此决定找几个简单的算法写,用php和java分别实现 1.合并两个有序链表将两个有序链表合并为一个新的有序链表并返回.新链表是通过拼接给定的两 ...
【Java】剑指offer(2) 不修改数组找出重复的数字
本文参考自<剑指offer>一书,代码采用Java语言. 更多:<剑指Offer>Java实现合集题目在一个长度为n+1的数组里的所有数字都在1到n的范围内,所以数组中至少 ...
大数据位图法（无重复排序，重复排序，去重复排序，数据压缩）之Java实现
1,位图法介绍位图的基本概念是用一个位(bit)来标记某个数据的存放状态,由于采用了位为单位来存放数据,所以节省了大量的空间.举个具体的例子,在Java中一般一个int数字要占用32位,如果能用一位 ...
一起来刷《剑指Offer》——不修改数组找出重复的数字（思路及Python实现）
数组中重复的数字在上一篇博客中<剑指Offer>-- 题目一:找出数组中重复的数字(Python多种方法实现)中,其实能发现这类题目的关键就是一边遍历数组一边查满足条件的元素. 然后我们 ...
LeetCode3_无重复字符的最长子串(数组&字符串问题)
题目: 给定一个字符串,请你找出其中不含有重复字符的最长子串的长度. 示例 1: 输入: "abcabcbb" 输出: 3 解释: 因为无重复字符的最长子串是 "ab ...

随机推荐

Java 读取指定目录下的文件名和目录名
需求:读取指定目录下的文件名和目录名实现如下: package com.test.common.util; import java.io.File; public class ReadFile { ...
SQL基础概念-指令
1,MySQL:(structured query language)用于访问和处理数据库的标准语言 2,什么是 SQL? SQL 指结构化查询语言 SQL 使我们有能 ...
【整理贴】DBA-常用到的动态视图分析语句
测试应用环境:SQL2008 R2.SQL2012.SQL2014 --语句1:获取前20逻辑读取次数或逻辑写入次数或CPU 时间 SELECT TOP 20 SUBSTRING(qt.TEXT, ( ...
怎么直接让火狐输入json数据，而不是弹出文件保存对话框？
一.问题再现: 我需要浏览器输出的是json数据,但是浏览器弹出的是一个文件保存的对话框,这样的体验有点差.所以想怎么让浏览器直接输出到浏览器的页面上面,并且格式的输出,还可以编辑. 测试数据: ht ...
Heartbeat+LVS构建高可用负载均衡集群
1.heartbeat简介: Heartbeat 项目是 Linux-HA 工程的一个组成部分,它实现了一个高可用集群系统.心跳服务和集群通信是高可用集群的两个关键组件,在 Heartbeat 项目里 ...
OpenStack虚拟机状态
OpenStack创建一个虚拟机,涉及到三种状态,vm_state,task_state和power_state. 先总结几点: 电源状态(power_state):是hypervisor的状态,从计 ...
selenium如何识别验证码
一:前面的文章写了如何右键另存为图片,把验证码存为图片后,接下来就是要做,怎么把图片上的内容获取到,借住tesseract工具 1.下载tesseract:http://sourceforge.net ...
【实践】获取CKEditor的html文本、纯文本、被选中的内容及赋值
<%=Html.TextAreaFor(Model => Model.WORK_INTRODUCTION)%> <script type="text/javasc ...
Golang tool:include spider library,image library and some other db library such as mysql,redis,mogodb,hbase,cassandra
一.Go_tool This is a tool library for Golang.Dont't worry about not understant it! All comment writes ...
第20章 DLL高级技术（2）
20.3 延迟载入DLL 20.3.1延迟载入的目的 (1)如果应用程序使用了多个DLL,那么它的初始化可能比慢,因为加载程序要将所有必需的DLL映射到进程的地址空间.→利用延迟加载可将载入过程延伸到 ...

从无重复大数组找TOP N元素的最优解说起

从无重复大数组找TOP N元素的最优解说起的更多相关文章

随机推荐

热门专题