海量日志数据提取某日访问百度次数最多的那个IP的Java实现

前几天在网上看到july的一篇文章《教你如何迅速秒杀掉：99%的海量数据处理面试题》,里面说到百度的一个面试题目，题目如下：

海量日志数据，提取出某日访问百度次数最多的那个IP。

july里面的分析如下。

1、分而治之/hash映射：针对数据太大，内存受限，只能是：把大文件化成(取模映射)小文件，即16字方针：大而化小，各个击破，缩小规模，逐个解决

2、 hash统计：当大文件转化了小文件，那么我们便可以采用常规的hash_map(ip，value)来进行频率统计。

3、堆/快速排序：统计完了之后，便进行排序(可采取堆排序)，得到次数最多的IP。

我的分析：

1、见july的1st.

2、见july的2nd.

3、不用排序，直接在统计的时候，计算出次数最多的IP：在第2步的时候，求出ip的次数，实际上呢，次数最大的那个只可能是一个值，因此在计算每个IP次数的时候，与这个最大值作比较，计算完即可知道最大值的IP是….

1 机器配置：

CPU:I3-2330M 2.20GHZ

MEM:4G(3.16G可用)

OS:win7 32位

2 生成海量数据的大文件：

2.1 总数据为1亿个IP数据，生成规则：以10.开头，其他是0-255的随机数。

/**
	 * 生成大文件
	 * @param ipFile
	 * @param numberOfLine
	 */
	public void gernBigFile(File ipFile,long numberOfLine){
		BufferedWriter bw = null;
		FileWriter fw = null;
		long startTime = System.currentTimeMillis();
		try{
			fw = new FileWriter(ipFile,true);
			bw = new BufferedWriter(fw);
 
			SecureRandom random = new SecureRandom();
			for (int i = 0; i < numberOfLine; i++) {
				bw.write("10."+random.nextInt(255)+"."+random.nextInt(255)+"."+random.nextInt(255)+"\n");
				if((i+1) % 1000 == 0){
					bw.flush();
				}
			}
			bw.flush();
 
			long endTime = System.currentTimeMillis();
			System.err.println(DateUtil.convertMillsToTime(endTime - startTime));
		}catch (Exception e) {
			e.printStackTrace();
		}finally{
			try{
				if(fw != null){
					fw.close();
				}
			}catch (Exception e) {
				e.printStackTrace();
			}
			try{
				if(bw != null){
					bw.close();
				}
			}catch (Exception e) {
				e.printStackTrace();
			}
		}
	}

/*
		 * 1、第一次生成1亿(实际上最多为16581375)的ip地址，需要时间为3分多钟不到4分钟。
		 */
		TooMuchIpFile tooMuchIpFile = new TooMuchIpFile();
		File ipFile = new File("e:/ipAddr.txt");
		try {
			ipFile.createNewFile();
		} catch (IOException e) {
			e.printStackTrace();
		}
		tooMuchIpFile.gernBigFile(ipFile, 100000000);

2.2 运行结果：

生成1亿行的Ip地址，大约耗时：3分多钟，大小1.27 GB (1,370,587,382字节)

3 分割大文件，

根据july的分析，取每个IP的hashCode，与1000取模，把IP散列到不同的文件中去。

3.1 第一种方法：

一边取每个IP的散列值，再模1000，得到一个值，然后写到此值对应的文件中去。大约耗时超过2个多小时，实在是太慢了，没跑完就直接断掉了。

/**
	 * 大文件分割为小文件
	 * @param ipFile
	 * @param numberOfFile
	 */
	public void splitFile(File ipFile,int numberOfFile){
		BufferedReader br = null;
		FileReader fr = null;
		BufferedWriter bw = null;
		FileWriter fw = null;
		long startTime = System.currentTimeMillis();
		try{
			fr = new FileReader(ipFile);
			br = new BufferedReader(fr);
			String ipLine = br.readLine();
			while(ipLine != null){
				int hashCode = ipLine.hashCode();
				hashCode = hashCode < 0 ? -hashCode : hashCode;
				int fileNum = hashCode % numberOfFile;
				File file = new File("e:/tmp/ip/"+ fileNum + ".txt");
				if(!file.exists()){
					file.createNewFile();
				}
				fw = new FileWriter(file,true);
				bw = new BufferedWriter(fw);
				bw.write(ipLine + "\n");
				bw.flush();
				fw.close();
				bw.close();
				ipLine = br.readLine();
			}
 
			long endTime = System.currentTimeMillis();
			System.err.println(DateUtil.convertMillsToTime(endTime - startTime));
		}catch (Exception e) {
			e.printStackTrace();
		}finally{
			try{
				if(fr != null){
					fr.close();
				}
			}catch (Exception e) {
				e.printStackTrace();
			}
			try{
				if(br != null){
					br.close();
				}
			}catch (Exception e) {
				e.printStackTrace();
			}
			try{
				if(fw != null){
					fw.close();
				}
			}catch (Exception e) {
				e.printStackTrace();
			}
			try{
				if(bw != null){
					bw.close();
				}
			}catch (Exception e) {
				e.printStackTrace();
			}
		}
	}

3.2 第二种方法：

与第一次方法基本相同，不同的是减少流对象的创建，只是创建文件时，创建流对象，但还是需要每次都要判断文件存在与否。大约耗时超过1个多小时，也实在是慢呀，没等它运行完就断了。

/**
	 * 大文件分割为小文件
	 * @param ipFile
	 * @param numberOfFile
	 */
	public void splitFile2(File ipFile,int numberOfFile){
		BufferedReader br = null;
		FileReader fr = null;
		BufferedWriter bw = null;
		FileWriter fw = null;
		long startTime = System.currentTimeMillis();
		try{
			fr = new FileReader(ipFile);
			br = new BufferedReader(fr);
			String ipLine = br.readLine();
			while(ipLine != null){
				int hashCode = ipLine.hashCode();
				hashCode = hashCode < 0 ? -hashCode : hashCode;
				int fileNum = hashCode % numberOfFile;
				File file = new File("e:/tmp/ip/"+ fileNum + ".txt");
				if(!file.exists()){
					file.createNewFile();
					fw = new FileWriter(file,true);
					bw = new BufferedWriter(fw);
					bwMap.put(fileNum, bw);
				}else{
					bw = bwMap.get(fileNum);
				}
				bw.write(ipLine + "\n");
				bw.flush();
				ipLine = br.readLine();
			}
			for(int fn : bwMap.keySet()){
				bwMap.get(fn).close();
			}
			bwMap.clear();
			long endTime = System.currentTimeMillis();
			System.err.println(DateUtil.convertMillsToTime(endTime - startTime));
		}catch (Exception e) {
			e.printStackTrace();
		}finally{
			try{
				if(fr != null){
					fr.close();
				}
			}catch (Exception e) {
				e.printStackTrace();
			}
			try{
				if(br != null){
					br.close();
				}
			}catch (Exception e) {
				e.printStackTrace();
			}
			try{
				if(fw != null){
					fw.close();
				}
			}catch (Exception e) {
				e.printStackTrace();
			}
			try{
				if(bw != null){
					bw.close();
				}
			}catch (Exception e) {
				e.printStackTrace();
			}
		}
	}

3.3 第三种方法：

与第二种方法基本相同，在此基础上，优化一边取值，一边写文件的过程，而是先写到内存中，当达到1000后，再一起写入文件中。大约耗时52多分钟，这个是实际运行完的，在中午去吃饭的时候让它自己跑完的。

/**
	 * 大文件分割为小文件
	 * @param ipFile
	 * @param numberOfFile
	 */
	public void splitFile3(File ipFile,int numberOfFile){
		BufferedReader br = null;
		FileReader fr = null;
		BufferedWriter bw = null;
		FileWriter fw = null;
		long startTime = System.currentTimeMillis();
		try{
			fr = new FileReader(ipFile);
			br = new BufferedReader(fr);
			String ipLine = br.readLine();
			while(ipLine != null){
				int hashCode = ipLine.hashCode();
				hashCode = hashCode < 0 ? -hashCode : hashCode;
				int fileNum = hashCode % numberOfFile;
				File file = new File("e:/tmp/ip/"+ fileNum + ".txt");
				if(!file.exists()){
					file.createNewFile();
					fw = new FileWriter(file,true);
					bw = new BufferedWriter(fw);
					bwMap.put(fileNum, bw);
					dataMap.put(fileNum, new LinkedList<String>());
				}else{
					List<String> list = dataMap.get(fileNum);
					list.add(ipLine + "\n");
					if(list.size() % 1000 == 0){
						BufferedWriter writer = bwMap.get(fileNum);
						for(String line : list){
							writer.write(line);
						}
						writer.flush();
						list.clear();
					}
				}
				ipLine = br.readLine();
			}
			for(int fn : bwMap.keySet()){
				List<String> list = dataMap.get(fn);
				BufferedWriter writer = bwMap.get(fn);
				for(String line : list){
					writer.write(line);
				}
				list.clear();
				writer.flush();
				writer.close();
			}
			bwMap.clear();
			long endTime = System.currentTimeMillis();
			System.err.println(DateUtil.convertMillsToTime(endTime - startTime));
		}catch (Exception e) {
			e.printStackTrace();
		}finally{
			try{
				if(fr != null){
					fr.close();
				}
			}catch (Exception e) {
				e.printStackTrace();
			}
			try{
				if(br != null){
					br.close();
				}
			}catch (Exception e) {
				e.printStackTrace();
			}
			try{
				if(fw != null){
					fw.close();
				}
			}catch (Exception e) {
				e.printStackTrace();
			}
			try{
				if(bw != null){
					bw.close();
				}
			}catch (Exception e) {
				e.printStackTrace();
			}
		}
	}

3.4 第四种方法：

在第三种方法基础上作进一步优化，不同的是，把创建1000个流对象放到循环外面。大约耗时13分钟35秒。这个方法实在比第三种方法快了4倍左右，但在我觉得，这时间还是有点说不过去呀。

/**
	 * 大文件分割为小文件
	 * @param ipFile
	 * @param numberOfFile
	 */
	public void splitFile4(File ipFile,int numberOfFile){
		BufferedReader br = null;
		FileReader fr = null;
		BufferedWriter bw = null;
		FileWriter fw = null;
		long startTime = System.currentTimeMillis();
		try{
			fr = new FileReader(ipFile);
			br = new BufferedReader(fr);
			String ipLine = br.readLine();
			//先创建文件及流对象方便使用
			for(int i=0;i<numberOfFile;i++){
				File file = new File("e:/tmp/ip1/"+ i + ".txt");
				bwMap.put(i, new BufferedWriter(new FileWriter(file,true)));
				dataMap.put(i, new LinkedList<String>());
			}
			while(ipLine != null){
				int hashCode = ipLine.hashCode();
				hashCode = hashCode < 0 ? -hashCode : hashCode;
				int fileNum = hashCode % numberOfFile;
				List<String> list = dataMap.get(fileNum);
				list.add(ipLine + "\n");
				if(list.size() % 1000 == 0){
					BufferedWriter writer = bwMap.get(fileNum);
					for(String line : list){
						writer.write(line);
					}
					writer.flush();
					list.clear();
				}
				ipLine = br.readLine();
			}
			for(int fn : bwMap.keySet()){
				List<String> list = dataMap.get(fn);
				BufferedWriter writer = bwMap.get(fn);
				for(String line : list){
					writer.write(line);
				}
				list.clear();
				writer.flush();
				writer.close();
			}
			bwMap.clear();
			long endTime = System.currentTimeMillis();
			System.err.println(DateUtil.convertMillsToTime(endTime - startTime));
		}catch (Exception e) {
			e.printStackTrace();
		}finally{
			try{
				if(fr != null){
					fr.close();
				}
			}catch (Exception e) {
				e.printStackTrace();
			}
			try{
				if(br != null){
					br.close();
				}
			}catch (Exception e) {
				e.printStackTrace();
			}
			try{
				if(fw != null){
					fw.close();
				}
			}catch (Exception e) {
				e.printStackTrace();
			}
			try{
				if(bw != null){
					bw.close();
				}
			}catch (Exception e) {
				e.printStackTrace();
			}
		}
	}

3.5 第五种方法：

使用多线程，未成功实现优化。只是给出思路如下：读取1亿数据的文件，循环读取每个IP，计算其散列值，取模1000，之后把其放到对应的队列中，当其队列超过1000时，启动一个服务线程把数据写入文件中。（也即主线程只负责计算，由其他线程负责写）

3.6 运行结果：

1、第一次分割1亿数据的大文件，实在是太慢，运行差不多一小时，才分割出300W数据，耗时超过2个钟头

2、第二次分割1亿数据的大文件，经过优化后，虽然比第一次有提升，但是还是很慢，耗时超过1个钟头.

3、第三次分割1亿数据的大文件，经过优化后，虽然比第二次有提升，但是还是很慢,需耗时52.0分3.6秒

4、第四次分割1亿数据的大文件，经过优化后，耗时13.0分35.10400000000004秒

4 统计

各个文件中出现次数最多的IP（可能有多个）:

采用的方法是一边统计各个IP出现的次数，一边算次数出现最大那个IP。

/**
	 * 统计，找出次数最多的IP
	 * @param ipFile
	 */
	public void read(File ipFile){
		BufferedReader br = null;
		FileReader fr = null;
		long startTime = System.currentTimeMillis();
		try{
			fr = new FileReader(ipFile);
			br = new BufferedReader(fr);
			String ipLine = br.readLine();
			while(ipLine != null){
				ipLine = ipLine.trim();
				Integer count = ipNumMap.get(ipLine);
				if(count == null){
					count = 0;
				}
				count ++;
				ipNumMap.put(ipLine, count);
 
				if(count >= ipMaxNum){
					if(count > ipMaxNum){
						keyList.clear();
					}
					keyList.add(ipLine);
					ipMaxNum = count;
				}
				ipLine = br.readLine();
			}
			long endTime = System.currentTimeMillis();
			System.err.println(ipFile.getName()+":"+DateUtil.convertMillsToTime(endTime - startTime));
			totalTime += (endTime - startTime);
		}catch (Exception e) {
			e.printStackTrace();
		}finally{
			try{
				if(fr != null){
					fr.close();
				}
			}catch (Exception e) {
				e.printStackTrace();
			}
			try{
				if(br != null){
					br.close();
				}
			}catch (Exception e) {
				e.printStackTrace();
			}
		}
	}

4.1 运行结果：

1、从1000个文件中查询Ip次数最多的Ip，10.164.143.57:24,3.0分18.748999999999995秒

2、从1000个文件中查询Ip次数最多的Ip，10.164.143.57:24,3.0分27.366000000000014秒

3、从1000个文件中查询Ip次数最多的Ip，10.164.143.57:24,2.0分42.781000000000006秒

5 以上代码的公共变量

public final Map<Integer,BufferedWriter> bwMap = new HashMap<Integer,BufferedWriter>();//保存每个文件的流对象
public final Map<Integer,List<String>> dataMap = new HashMap<Integer,List<String>>();//分隔文件用
private Map<String,Integer> ipNumMap = new HashMap<String, Integer>();//保存每个文件中的每个IP出现的次数
	private List<String> keyList = new LinkedList<String>();//保存次数出现最多的IP
	private int ipMaxNum = 0;//次数出现最多的值
	private long totalTime = 0;//计算统计所耗的时间

6 Main

public static void main(String[] args) {
/*
	 * 1、第一次生成1亿(实际上最多为16581375)的ip地址，需要时间为3分多钟不到4分钟。
	 */
		/*TooMuchIpFile tooMuchIpFile = new TooMuchIpFile();
		File ipFile = new File("e:/ipAddr.txt");
		try {
			ipFile.createNewFile();
		} catch (IOException e) {
			e.printStackTrace();
		}
		tooMuchIpFile.gernBigFile(ipFile, 100000000);*/
 
//		System.err.println("128.128.80.226".hashCode()%1000);
//		System.err.println("128.128.80.227".hashCode());
//		System.err.println("10.128.80.227".hashCode());
//		System.err.println("10.0.80.227".hashCode());
 
		/*
		 * 1、第一次分割1亿数据的大文件，实在是太慢，运行差不多一小时，才分割出300W数据，耗时超过2个钟头
		 * 2、第二次分割1亿数据的大文件，经过优化后，虽然比第一次有提升，但是还是很慢，耗时超过1个钟头.
		 * 3、第三次分割1亿数据的大文件，经过优化后，虽然比第二次有提升，但是还是很慢,需耗时52.0分3.6秒
		 * 4、第四次分割1亿数据的大文件，经过优化后，耗时13.0分35.10400000000004秒
	 */
		TooMuchIpFile tooMuchIpFile = new TooMuchIpFile();
		File ipFile = new File("e:/ipAddr.txt");
		tooMuchIpFile.splitFile4(ipFile, 1000);
 
		/*
		 * 1、从1000个文件中查询Ip次数最多的Ip，10.164.143.57:24,3.0分18.748999999999995秒
		 * 2、从1000个文件中查询Ip次数最多的Ip，10.164.143.57:24,3.0分27.366000000000014秒
		 * 3、从1000个文件中查询Ip次数最多的Ip，10.164.143.57:24,2.0分42.781000000000006秒
		 */
//		TooMuchIpFile tooMuchIpFile = new TooMuchIpFile();
//		File ipFiles = new File("e:/tmp/ip1/");
//		for (File ipFile : ipFiles.listFiles()) {
//			tooMuchIpFile.read(ipFile);
//			tooMuchIpFile.ipNumMap.clear();
//		}
//		System.err.println("======================出现次数最多的IP==================");
//		for(String key: tooMuchIpFile.keyList){
//			System.err.println(key + ":" + tooMuchIpFile.ipMaxNum);
//		}
//		System.err.println(DateUtil.convertMillsToTime(tooMuchIpFile.totalTime));
	}

海量日志数据提取某日访问百度次数最多的那个IP的Java实现的更多相关文章

14海量日志提取出现次数最多的IP
问题描述:现有某网站海量日志数据,提取出某日访问该网站次数最多的那个IP. 分析:IP地址是32位的二进制数,所以共有N=2^32=4G个不同的IP地址, 如果将每个IP地址看做是数组的索引的话,那么 ...
一次flume exec source采集日志到kafka因为单条日志数据非常大同步失败的踩坑带来的思考
本次遇到的问题描述,日志采集同步时,当单条日志(日志文件中一行日志)超过2M大小,数据无法采集同步到kafka,分析后,共踩到如下几个坑.1.flume采集时,通过shell+EXEC(tail -F ...
海量日志收集利器 —— Flume
Flume 是什么? Flume是一个分布式.可靠.和高可用的海量日志聚合的系统,支持在系统中定制各类数据发送方,用于收集数据:同时,Flume提供对数据进行简单处理,并写到各种数据接受方(可定制)的 ...
面试笔试-脚本-1：使用shell脚本输出登录次数最多的用户
原题目: 一个文本类型的文件,里面每行存放一个登陆者的IP(某些行是反复的),写一个shell脚本输出登陆次数最多的用户. 之前刚看到这个题目时,立即没有想到一行直接解决的办法,尽管知道能够先进行排序 ...
Java面试基础--（出现次数最多的字符串）
题目:给定字符串,求出现次数最多的那个字母及次数,如有多个重复则都输出. eg,String data ="aaavzadfsdfsdhshdWashfasdf": 思路: 1. ...
使用python找出nginx访问日志中访问次数最多的10个ip排序生成网页
使用python找出nginx访问日志中访问次数最多的10个ip排序生成网页方法1:linux下使用awk命令 # cat access1.log | awk '{print $1" &q ...
Python习题-统计日志中访问次数超过限制的IP
#1.1分钟之内ip访问次数超过200次的,就给他的ip加入黑名单#需求分析: #1.读日志,1分钟读一次 #2.获取这1分钟之内所有访问的ip #3.判断ip出现的次数,如果出现200次,那么就加入 ...
大数据学习——有两个海量日志文件存储在hdfs
有两个海量日志文件存储在hdfs上, 其中登陆日志格式:user,ip,time,oper(枚举值:1为上线,2为下线):访问之日格式为:ip,time,url,假设登陆日志中上下线信息完整,切同一上 ...
MongoDB应用案例：使用 MongoDB 存储日志数据
线上运行的服务会产生大量的运行及访问日志,日志里会包含一些错误.警告.及用户行为等信息,通常服务会以文本的形式记录日志信息,这样可读性强,方便于日常定位问题,但当产生大量的日志之后,要想从大量日志里挖 ...

随机推荐

泰晓科技 +兰大开源社区 +程序动态分析---LINUX内核网站
http://www.tinylab.org/ http://linux-talents.tinylab.org/lzuoss/ http://www.tinylab.org/source-code- ...
Oracle修改被占用的临时表结构
这两天在修改临时表的类型时,提示”attempt to create,alter or drop an index on temporary table already in use“的错误,由于临时 ...
C# Socket通信小案例
本文将编写2个控制台应用程序,一个是服务器端(server),一个是客户端(client), 通过server的监听,有新的client连接后,接收client发出的信息. server代码如下: u ...
Java使用poi对Execl简单_写_操作
public class WriteExecl { @Test public void writeExeclTest() throws Exception{ OutputStream os = new ...
(转)HTML表格边框的设置小技巧
对于很多初学HTML的人来说,表格<table>是最常用的标签了,但对于表格边框的控制,很多初学者却不甚其解. 对于很多初学HTML的人来说,表格<table>是最常用的标签了 ...
tomcat的webapp下的root文件夹的作用是什么
1.基本一样..只是表示不同的tomcat的http路径而已. root目录默认放的是tomcat自己的一个项目,如:http://localhost:8080/默认访问root项目对于webapp ...
解决UITableView中Cell重用机制导致内容出错的方法总结
UITableView继承自UIScrollview,是苹果为我们封装好的一个基于scroll的控件.上面主要是一个个的 UITableViewCell,可以让UITableViewCell响应一些点 ...
iOS菜鸟之苹果开发者账号的注册
大家一起来讨论讨论苹果开发者账号的注册(主要是以公司的开发者账号为例),前段时间公司要求注册开发者账号,于是我就特地看了看相关的帖子.这里简单给大家总结一下具体的流程. 首先你要登陆这个网址,进去之后 ...
Java操作hbase总结
用过以后,总得写个总结,不然,就忘喽. 一.寻找操作的jar包. java操作hbase,首先要考虑到使用hbase的jar包. 因为咱装的是CDH5,比较方便,使用SecureCRT工具,远程连接到 ...
java编码转化方案-备用
import java.io.UnsupportedEncodingException; /** * 转换字符串的编码 */ public class changeCharSet { /** 7位AS ...

海量日志数据提取某日访问百度次数最多的那个IP的Java实现