Mapreuduce实现网络数据包的清洗工作

处理后的数据可直接放到hive或者mapreduce程序来统计网络数据流的信息，比如当前实现的是比较简单的http的Get请求的统计

第一个mapreduce：将时间、十六进制包头信息提取出来，并放在一行（这里涉及到mapreduce的键值对的对多行的特殊处理，是个值得注意的地方）

主要遇到两个问题：

　　一个数据包包含时间，包头的简单信息，包头的详细信息，初衷是想要把一个数据包的时间、包十六进制详细信息（存在于很多行里）按照顺序放置到一行，在java里面按行读取，很好实现。

针对mapreduce的键值对处理的特性，原来想到有两种方式解决：

（1）以时间的key值为准，一个包的信息key值与其相同

但MR的map每次只处理一行信息，而reduce只对键相同的行做处理，而且从map阶段到reduce的过程中有一个shuffle、sort阶段（估计是这个原因，也可能是因为离reduce近的机器处理完直接发给reduce，先到先处理），相同的key的value是乱序的。

(2)所有的key值递增

这样就没有相同的key值，无法放置到一行

最后的解决办法：

（3）以时间的key值为准，同一个包的信息的key值与其相同，但在十六进制行里加一个递增的id，放置到一行，虽然是乱序的，但自带ID，就重新排一下就好啦，妙！

第二个mapreduce: 对十六进制信息进行排序，是第一个mapreduce的补充，至此，清洗工作完毕，可以统计任意位置的十六进制来分析数据

第三个mapreduce：统计http发送的GET请求个数

static int id=1;

	static int hexId=1;

  public static class TokenizerMapper

       extends Mapper<Object, Text, IntWritable, Text>

 {

    private final static IntWritable one = new IntWritable(2);

    private Text word = new Text();

    public void map(Object key, Text value, Context context

                    ) throws IOException, InterruptedException

    {

    	//匹配时间

	 	String regexTime = "([0-2][0-4]):([0-5][0-9]):([0-5][0-9]).[0-9]{6}";// 11:08:56.149361

		Pattern patternTime = Pattern.compile(regexTime);

		Matcher matchTime = patternTime.matcher(value.toString());

		while (matchTime.find()) {

			String time ="time: " + matchTime.group()+" ";

			id=id+1;

			word.set(time);

			one.set(id);

			context.write(one, word);

		}

		//匹配十六进制

//		String regexHex = "0x[0-9]{4}:  ([A-Za-z0-9]{4} )+";

		String regexHex = " ([A-Za-z0-9]{4} )+";

		Pattern patternHex = Pattern.compile(regexHex);

		Matcher matchHex = patternHex.matcher(value.toString());

		while (matchHex.find()) {

			String hex = " "+ matchHex.group();

			hexId=hexId+1;

			hex="id:"+String.valueOf(hexId)+" "+hex;

			word.set(hex);

			one.set(id);

			context.write(one, word);

		}

    }

  }

  public static class IntSumReducer

       extends Reducer<IntWritable,Text,IntWritable,Text>

{

    private Text result = new Text();

    public void reduce(IntWritable key, Iterable<Text> values,

                       Context context

                       ) throws IOException, InterruptedException

  {

      String sum = "";

      for (Text val : values)

        {

          sum += val.toString();

         }

      result.set(sum);

      context.write(key, result);

    }

  }

public static class TokenizerMapper

       extends Mapper<Object, Text, Text, Text>

 {

    private final static Text one = new Text();

    private Text word = new Text();

    public void map(Object key, Text value, Context context

                    ) throws IOException, InterruptedException

    {

    	//匹配时间

	 	String regexTime = "time: ([0-2][0-4]):([0-5][0-9]):([0-5][0-9]).[0-9]{6}";// 11:08:56.149361

		Pattern patternTime = Pattern.compile(regexTime);

		Matcher matchTime = patternTime.matcher(value.toString());

		while (matchTime.find()) {

//			String time ="time: " + matchTime.group()+" ";

			String temptime =matchTime.group();

			String time =temptime.substring(6, temptime.length()-1);

			one.set(time);

		}

		//排序十六进制

//		String regexHex = "0x[0-9]{4}:  ([A-Za-z0-9]{4} )+";

		List<Bar> list = new ArrayList<Bar>();

		String regexHex = "id:([0-9])+   ([A-Za-z0-9]{4} )+";

		Pattern patternHex = Pattern.compile(regexHex);

		Matcher matchHex = patternHex.matcher(value.toString());

		while (matchHex.find()) {

			Bar bar = new Bar();

			String hexline = matchHex.group();

			String regexHex2 ="id:([0-9])+"; //一行十六进制的序号

			Pattern patternHex2 = Pattern.compile(regexHex2);

			Matcher matchHex2 = patternHex2.matcher(hexline);

			while (matchHex2.find()) {

				String lineId=matchHex2.group().toString().substring(3);

				bar.setId(lineId);

			}

			String regexHex3 ="([A-Za-z0-9]{4} )+"; //一行十六进制

			Pattern patternHex3 = Pattern.compile(regexHex3);

			Matcher matchHex3 = patternHex3.matcher(hexline);

			while (matchHex3.find()) {

				String lineHex= matchHex3.group().toString();

				bar.setHexValue(lineHex);

			}

			list.add(bar);

		}

		StringBuffer buffer = new StringBuffer("");

		 Collections.sort(list);

		for(int i=0;i<list.size();i++){

			Bar bar=list.get(i);

			String lineHex=bar.getHexValue();

			buffer.append(lineHex);

		}

		String hexOne= buffer.toString();

		word.set(hexOne);

		context.write(one, word);

    }

  }

  public static class IntSumReducer

       extends Reducer<Text,Text,Text,Text>

{

    private Text result = new Text();

    public void reduce(Text key, Iterable<Text> values,

                       Context context

                       ) throws IOException, InterruptedException

  {

      String sum = "";

      for (Text val : values)

        {

    	  context.write(key, val);

         }

    }

  }

	public static class TokenizerMapper extends

			Mapper<Object, Text, Text, IntWritable> {

		private final static IntWritable one = new IntWritable(1);

		private Text word = new Text("sumGet");

		public void map(Object key, Text value, Context context)

				throws IOException, InterruptedException {

			int timelen=15;

			int getlen=20*5+timelen;

			String strline=value.toString();

			if (strline.length() > getlen) {// ||hexValue[20].equals("4854")

				String getPos=strline.substring(timelen+20*5,timelen+21*5-1);

				 if(getPos.equals("4745")){

					 context.write(word, one);

				 }

			}

		}

	}

	public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

		private IntWritable result = new IntWritable();

		public void reduce(Text key, Iterable<IntWritable> values, Context context)

				throws IOException, InterruptedException {

			int sum =0;

			for (IntWritable val : values) {

				sum+=val.get();

			}

			result.set(sum);

			context.write(key, result);

		}

	}

Mapreuduce实现网络数据包的清洗工作的更多相关文章

用C++实现网络编程---抓取网络数据包的实现方法
一般都熟悉sniffer这个工具,它可以捕捉流经本地网卡的所有数据包.抓取网络数据包进行分析有很多用处,如分析网络是否有网络病毒等异常数据,通信协议的分析(数据链路层协议.IP.UDP.TCP.甚至各 ...
UNIX网络编程——网络数据包检测
网络数据包检测数据包捕获(sniffer):是指在网络上进行数据收集的行为,需要通过网卡来完成. 三种访问方式: BSD Packet Filter(BPF) SVR4 Datalink Provi ...
LINUX下的远端主机登入校园网络注册网络数据包转发和捕获
第一部分:LINUX 下的远端主机登入和校园网注册校园网内目的主机远程管理登入程序本程序为校园网内远程登入,管理功能,该程序分服务器端和客户端两部分:服务器端为remote_server_udp. ...
Linux 中的网络数据包捕获
Linux 中的网络数据包捕获 Ashish Chaurasia, 工程师简介: 本教程介绍了捕获和操纵数据包的不同机制.安全应用程序,如 VPN.防火墙和嗅探器,以及网络应用程序,如路由程序,都依 ...
Linux内核中网络数据包的接收-第一部分概念和框架
与网络数据包的发送不同,网络收包是异步的的.由于你不确定谁会在什么时候突然发一个网络包给你.因此这个网络收包逻辑事实上包括两件事:1.数据包到来后的通知2.收到通知并从数据包中获取数据这两件事发生在协 ...
网络数据包分析网卡Offload
http://blog.nsfocus.net/network-packets-analysis-nic-offload/ 对于网络安全来说,网络传输数据包的捕获和分析是个基础工作,绿盟科技研 ...
Linux内核网络数据包处理流程
Linux内核网络数据包处理流程 from kernel-4.9: 0. Linux内核网络数据包处理流程 - 网络硬件网卡工作在物理层和数据链路层,主要由PHY/MAC芯片.Tx/Rx FIFO. ...
sk_buff封装和解封装网络数据包的过程详解（转载）
http://dog250.blog.51cto.com/2466061/1612791 可以说sk_buff结构体是Linux网络协议栈的核心中的核心,几乎所有的操作都是围绕sk_buff这个结构体 ...
linux2.6.24内核源代码分析（2）——扒一扒网络数据包在链路层的流向路径之一
在2.6.24内核中链路层接收网络数据包出现了两种方法,第一种是传统方法,利用中断来接收网络数据包,适用于低速设备:第二种是New Api(简称NAPI)方法,利用了中断+轮询的方法来接收网络数据包, ...

随机推荐

带有 thead、tbody 以及 tfoot 元素的 HTML 表格
设置样式: <head><style type="text/css">thead {color:green}tbody {color:blue;height ...
JDK安装及Tomcat安装
JDK安装及Tomcat安装 JDK 解压JDK到常用盘符 D为例 Tomcat安装将tomcat.zip解压到常用的根目录下,我这里以D盘为例.这样就算安装好了! 接下来开始配置环境变量,打开环境 ...
rabbitmq配置文件和站点管理（二）
前面介绍了erlang环境的安装和rabbitmq环境安装,接下来对rabbitmq详细配置和管理: 启用后台管理插件创建目录 mkdir /etc/rabbitmq 启用插件 rabbitmq-p ...
vue打包后不使用服务器直接访问方法
根据官网打包执行npm run build 后dist文件夹打开的index.html 是空白需要开启http服务器才能访问,以下是解决办法 1.找到config文件夹下的index文件修改成 2 ...
ABP官方文档翻译 6.2.1 ASP.NET Core集成
ASP.NET Core 介绍迁移到ASP.NET Core? 启动模板配置启动类模块配置控制器应用服务作为控制器过滤器授权过滤器审计Action过滤器校验过滤器工作单元Acti ...
让44.1版本的sketch打开更高版本的sketch文件
我们都知道,sketch的有效license与版本挂钩.最近设计师又更新了sketch版本,导致她生成的源文件我都无法打开. 毕竟我不是使用sketch进行UI设计,仅用它来查看设计稿参数,再花99美 ...
Java对正则表达式的支持(一)
Java对正则表达式的支持主要体现在String.Pattern.Matcher和Scanner类. 1.Pattern.Matcher 先看一个Pattern和Matcher类使用正则表达式的例子. ...
java异常丢失及异常链
1.Java中异常丢失的情况: 先定义三个异常: public class ExceptionA extends Exception { public ExceptionA(String str) { ...
BZOJ 2809: [Apio2012]dispatching [主席树 DFS序]
传送门题意:查询树上根节点值*子树中权值和$\le m$的最大数量最大值是多少求$DFS$序,然后变成区间中和$\le m$最多有几个元素,建主席树,然后权值线段树上二分就行了 $WA$:又把边 ...
IntelliJ IDEA使用心得之基础篇
今天和大家分享一个非常好用的Java开发工具-IntelliJ IDEA. 下载地址:https://www.jetbrains.com/idea/ 目录: 1)IntelliJ IDEA使用心得之基 ...

Mapreuduce实现网络数据包的清洗工作

Mapreuduce实现网络数据包的清洗工作的更多相关文章

随机推荐

热门专题