配置Nginx

yum install nginx （在host99和host101）
service nginx start开启服务
ps -ef |grep nginx看一下进程

ps -ef |grep nginx

root     28230     1  0 14:54 ?        00:00:00 nginx: master process /usr/sbin/nginx -c /etc/nginx/nginx.conf

nginx    28231 28230  0 14:54 ?        00:00:00 nginx: worker process

nginx    28232 28230  0 14:54 ?        00:00:00 nginx: worker process

nginx    28234 28230  0 14:54 ?        00:00:00 nginx: worker process

nginx    28235 28230  0 14:54 ?        00:00:00 nginx: worker process

....

在本机浏览器也可以通过ip + port(默认80端口)访问

HTML

很简单的网页配置，在/root/html下放了一个index.html文件
先test一下是否正常，在browser里面输入ip地址就可以看到了。
下一步配置一个输入框，模拟搜索数据。
index.html是特别简单的代码，（我是前端渣- -

<!DOCTYPE html>

<html>

    <body>

        <form action="/result.html" method="GET">

            Please Input:<br>

            <input type="text" name="Input" value="Mickey">

            <br>

            <br>

            <input type="submit" value="Submit">

        </form> 

    </body>

</html>

跳转到的页面result.html就随便写一句话啦

log日志

log日志放在/var/log/nginx目录的access.log中。
```
tail -f access.log
```
在输入框输入数据，提交之后会看到log刷新，比如我输入“test”提交后log会刷新一条：

10.109.255.90 - - [11/May/2017:14:24:16 +0800] "GET /action_page.php?Input=test HTTP/1.1" 404 169 "http://10.3.242.101/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/602.4.8 (KHTML, like Gecko) Version/10.0.3 Safari/602.4.8"

配置Flume

安装：（在host101）

wget http://mirrors.tuna.tsinghua.edu.cn/apache/flume/1.7.0/apache-flume-1.7.0-bin.tar.gz

tar -xvf apache-flume-1.7.0-bin.tar.gz

测试：

在conf文件夹下新建一个test.conf文件如下：

# Name the components on this agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type = netcat

a1.sources.r1.bind = localhost

a1.sources.r1.port = 44444

# Describe the sink

a1.sinks.k1.type = logger

# Use a channel which buffers events in memory

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

然后，运行

bin/flume-ng agent -n a1 -c conf -f conf/test.conf

然后在logs文件夹下运行如下，刷新log
```
tail -f flume.log
```
然后telnet localhost 44444。随便输入一些数据。log刷新显示出

11 May 2017 15:26:55,372 INFO  [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.LoggerSink.process:95)  - Event: { headers:{} body: 64 61 6B 6A 64 63 0D                            dakjdc. }

就没问题啦

Nginx(WebServer)端

Nginx端配置source-side flume agent来收集访问日志流数据

WebAccLog.sources = NginxAccess

WebAccLog.sinks = AvroSink

WebAccLog.channels = MemChannel

WebAccLog.sources.NginxAccess.type = exec

WebAccLog.sources.NginxAccess.command = tail -f /var/log/nginx/access.log

WebAccLog.sources.NginxAccess.batchSize = 10

WebAccLog.sources.NginxAccess.interceptors = itime

WebAccLog.sources.NginxAccess.interceptors.itime.type = timestamp

WebAccLog.sinks.AvroSink.type = avro

WebAccLog.sinks.AvroSink.hostname = 10.3.242.99

WebAccLog.sinks.AvroSink.port = 4545

WebAccLog.channels.MemChannel.type = memory

WebAccLog.sinks.AvroSink.channel = MemChannel

WebAccLog.sources.NginxAccess.channels = MemChannel

注意到我们在这里配置对source配置了一下interceptor拦截器。flume中source采集的日志首先会传入ChannelProcessor，在其内首先会通过interceptors进行过滤加工，然后通过ChannelSelector选择channel。
这里配置了一个时间戳拦截器，后面会在指定hdfs.path的时候用到(根据时间戳来指定存放路径)。
关于拦截器：flume内部实现了很多拦截器，同时还是先虑InterceptorChain用来链式处理event。
- HostInterceptor：在所有拦截的events的header中上加上本机的host name或IP
- TimestampInterceptor：在所有拦截的events的header中上加上它处理该时间的时间(in millis)。
这里之前写的时候source忘记加s了，然后查一下log才发现的问题，说实话这个conf文件好容易眼花，要改的东西太多- -。

Hadoop/Spark cluster端

Hadoop/Spark cluster端配置receiver- side flume agent来接收数据.
首先使用logger sink来测试下是否一切正常
- 启动cluster端的flume，然后在web server端运行
- ```
bin/flume-ng avro-client -c ./conf -H 10.3.242.99 -p 4545 -F /var/log/nginx/access.log
```
- 然后cluster端的flume会在log里面打出一堆数据，说明可以了。
接下来，把两边连通
下面是一个HDFS sink的例子，其他需求可以通过修改sink部分实现。

# clusterLogAgent

# Naming the components of the current agent.

clusterLogAgent.sources = AvroSource

clusterLogAgent.sinks = HDFS

clusterLogAgent.channels = MemChannel

# Source configuration

clusterLogAgent.sources.AvroSource.type = avro

# hostname or IP address to listen on

clusterLogAgent.sources.AvroSource.bind = 0.0.0.0

clusterLogAgent.sources.AvroSource.port = 4545

# sink configuration(write to HDFS)

clusterLogAgent.sinks.HDFS.type = hdfs

clusterLogAgent.sinks.HDFS.hdfs.path = /logFlume/nginx/accesslog

# File format: currently SequenceFile, DataStream or CompressedStream

clusterLogAgent.sinks.HDFS.hdfs.fileType = DataStream

# Number of events written to file before it rolled (0 = never roll based on number of events)

clusterLogAgent.sinks.HDFS.hdfs.rollCount = 0

clusterLogAgent.channels.MemChannel.type = memory

clusterLogAgent.sources.AvroSource.channels = MemChannel

clusterLogAgent.sinks.HDFS.channel = MemChannel

启动

首先开启cluster上的flume服务(这里要注意开启顺序

./bin/flume-ng agent -n clusterLogAgent -c conf -f conf/flume-hdfsSink.conf

开启Nginx端的flume服务

bin/flume-ng agent -n WebAccLog -c conf -f conf/flume-avro.conf

check HDFS来验证流访问log events是否成功写入

[root@host99 /home/hhh/apache-flume-1.7.0-bin/conf]$hadoop fs -ls /logFlume/nginx/accesslog

17/05/11 17:25:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Found 3 items

-rw-r--r--   2 root supergroup       1146 2017-05-11 17:15 /logFlume/nginx/accesslog/FlumeData.1494494113448

-rw-r--r--   2 root supergroup       1190 2017-05-11 17:15 /logFlume/nginx/accesslog/FlumeData.1494494113449

-rw-r--r--   2 root supergroup        952 2017-05-11 17:16 /logFlume/nginx/accesslog/FlumeData.1494494113450

[root@host101 /home/hhh/apache-flume-1.7.0-bin/conf]$hadoop fs -cat /logFlume/nginx/accesslog/FlumeData.1494494113448

17/05/11 17:19:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

10.109.255.90 - - [11/May/2017:16:56:35 +0800] "GET /result.html?Input=ttt HTTP/1.1" 200 23 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/602.4.8 (KHTML, like Gecko) Version/10.0.3 Safari/602.4.8"

10.109.255.90 - - [11/May/2017:16:56:45 +0800] "GET /result.html?Input=ttt HTTP/1.1" 200 23 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/602.4.8 (KHTML, like Gecko) Version/10.0.3 Safari/602.4.8"

10.109.255.90 - - [11/May/2017:16:57:04 +0800] "GET /result.html?Input=hhhhhh HTTP/1.1" 200 23 "http://10.3.242.101/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/602.4.8 (KHTML, like Gecko) Version/10.0.3 Safari/602.4.8"

10.109.255.90 - - [11/May/2017:16:57:07 +0800] "GET /result.html?Input=hhhhhh HTTP/1.1" 200 23 "http://10.3.242.101/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/602.4.8 (KHTML, like Gecko) Version/10.0.3 Safari/602.4.8"

Flume conf调优

可优化的配置：

Avro Sink：
- batch-size: 一次发送多少数据，默认是10
- compression-type：压缩类型，默认不压缩。
- compression-level: 压缩等级
- maxIoWorkers：最大的I/O worker线程个数，默认2*机器最大可用处理器数目

Avro Source：
- threads：最大worker线程数
HDFS sink：

HDFS sink支持创建text和sequence文件；支持压缩；
文件可以被rolled。rolled基于elapse time或数据size或events数量
可以根据events的属性(比如时间戳或机器)来划分数据(buckets/partitions data)
HDFS路径可以包含格式化转移序列(formatiting escape sequences)，HDFS可以据其产生一个目录/文件名来存储events。如下：

Alias	Description
%{host}	Substitute value of event header named “host”. Arbitrary header names are supported.
%t	Unix time in milliseconds
%a	locale’s short weekday name (Mon, Tue, ...)
%A	locale’s full weekday name (Monday, Tuesday, ...)
%b	locale’s short month name (Jan, Feb, ...)
%B	locale’s long month name (January, February, ...)
%c	locale’s date and time (Thu Mar 3 23:05:25 2005)
%d	day of month (01)
%e	day of month without padding (1)
%D	date; same as %m/%d/%y
%H	hour (00..23)
%I	hour (01..12)
%j	day of year (001..366)
%k	hour ( 0..23)
%m	month (01..12)
%n	month without padding (1..12)
%M	minute (00..59)
%p	locale’s equivalent of am or pm
%s	seconds since 1970-01-01 00:00:00 UTC
%S	second (00..59)
%y	last two digits of year (00..99)
%Y	year (2010)
%z	+hhmm numeric timezone (for example, -0400)
%[localhost]	Substitute the hostname of the host where the agent is running
%[IP]	Substitute the IP address of the host where the agent is running
%[FQDN]	Substitute the canonical hostname of the host where the agent is running

- 一些可能的优化配置：
  - hdfs.rollSize:触发roll的文件大小，in bytes，默认是1024
  - hdfs.rollCount: 在roll之前写入文件的events数目，默认是10。设置为0的话就不会根据events数量roll了。
  - hdfs.rollInterval:roll当前文件所等待的秒数，默认是9=30。设置为0的话就不会根据时间间隔roll了。
  - hdfs.batchSize: 在flushed到HDFS之前写入到文件的events数目
  - hdfs.threadsPoolSize:每个HDFS sink中HDFS IO操作的线程数目，默认10

已完成的优化

hdfs文件目录：按日期分目录
```
# flume-hdfs.sink
```
clusterLogAgent.sinks.HDFS.hdfs.path = /logFlume/nginx/%y.%m.%d/
然后就变成啦酱紫：

配置滚动文件的大小，避免产生一堆小文件

#clusterLogAgent.sinks.HDFS.hdfs.rollSize =64*1024*1024

clusterLogAgent.sinks.HDFS.hdfs.rollSize = 67108864

clusterLogAgent.sinks.HDFS.hdfs.rollCount = 0

clusterLogAgent.sinks.HDFS.hdfs.rollInterval = 0

TBD

压缩
线程、worker数目
...
flume内部是可以做一些过滤的。基于当前场景的话，是要用到正则的，主要是考虑这样的话会不会很影响效率。因为flume内部也不适合做复杂的过滤。

Log Analyse

搜索词分析-Ngram

提取搜索词：正则。

10.109.255.90 - - [12/May/2017:14:31:46 +0800] "GET /result.html?Input=Mickey HTTP/1.1" 304 0 "http://10.3.242.101/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/602.4.8 (KHTML, like Gecko) Version/10.0.3 Safari/602.4.8"

10.30.146.74 - - [12/May/2017:14:33:55 +0800] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"

可以看到就是要提取 "?Input=Mickey"这一部分。用正则“\?Input=[^\s]*\s”来匹配这段。

　　　还有一点就是看到有的log是没有input，状态是304的。这种请求是因为用了浏览器缓存。考虑到搜索结果不能是固定的，所以在conf里面加了一句add_header Cache-Control no-store;来禁用缓存。

在mapper的中建Pattern，而不是在map()函数中每次初始化。
用搜索词做Ngram，这里直接假设搜索词已经是'+'-split(在搜索框space分隔的数据会在log里变成+分隔的)。【TODO：后续可以实现切分】
MapReduce实现：
- Job1: SplitNgram --> 正则匹配搜索词，切分搜索词，统计ngram出现次数；
- Job2： GetNgram --> 过滤count小于指定threshold的ngram，同时得到每个origin词联想到的topK。

模拟搜索

nili小可爱真是可怜又坚强什么也没有零数据支持
模拟的方式简单，就写一个程序(python or java or scala or whatever)，就好啦
但是要稍微有点意义的搜索词，目前想到的方法就是找一堆数据，脚本一边读一边向nginx请求。

模拟http请求的demo如下：

import httplib

coon = httplib.HTTPConnection('host101')

word = 'test'

conn.request('GET', 'result.html?input=' + word)

以及搜索数据目前能找到的方式就是，我在kaggle上看到的一个quora的比赛，可以拿到一些quora的问题，这在某种程度上也是搜索数据了吧。hhh自我满足ing...[所以其实后续想再扩展的话可以考虑爬quora呀23333]

加上搜索数据的整个模拟demo如下：【注：连接的效率等问题暂未考虑..

def get_search_word(file_name):

    # id, qid1, qid2, question1, question2, is_duplicate

    with open(file_name) as fi:

        for line in fi:

            splited = line.split(",")

            if len(splited) < 6:

                continue

            conn = httplib.HTTPConnection('10.3.242.101')

            conn.request('GET', 'result.html?input=' + splited[3])

            # time.sleep(5)

            conn = httplib.HTTPConnection('10.3.242.101')

            conn.request('GET', 'result.html?input=' + splited[4])

            time.sleep(5)

Nginx+Flume+Hadoop日志分析，Ngram+AutoComplete的更多相关文章

Hadoop日志分析系统启动脚本
Hadoop日志分析系统启动脚本 #!/bin/bash #Flume日志数据的根文件夹 root_path=/flume #Mapreduce处理后的数据文件夹 process_path=/proc ...
hadoop 日志分析
1:在每一个tomcat服务器上,生成的日志目录中,在java中用定时器每天将当天的日志上传到hadoop中 (技术要点:quatz+hadoop-client)具体的目录动态的采用时间品名 2:ha ...
Hadoop日志分析工具——White Elephant
White Elephant 是一个Hadoop日志收集器和展示器,它提供了用户角度的Hadoop集群可视化.White Elephant 是全球最大的职业社交网站Linkedin开发的一套分析Had ...
hadoop日志分析
一.项目要求本文讨论的日志处理方法中的日志,仅指Web日志.事实上并没有精确的定义,可能包含但不限于各种前端Webserver--apache.lighttpd.nginx.tomcat等产生的用户 ...
nginx acces.log日志分析
1,统计各访问IP的总数 awk '{if($9>0 && $9==200 && substr($6,2)== "GET") a[$1]++} ...
Hadoop 日志分析。
http://www.ibm.com/developerworks/cn/java/java-lo-mapreduce/
SparkStreaming实时日志分析--实时热搜词
Overview 整个项目的整体架构如下: 关于SparkStreaming的部分: Flume传数据到SparkStreaming:为了简单使用的是push-based的方式.这种方式可能会丢失数据 ...
Hadoop日志文件分析系统
Hadoop日志分析系统项目需求: 需要统计一下线上日志中某些信息每天出现的频率,举个简单的例子,统计线上每天的请求总数和异常请求数.线上大概几十台服务器,每台服务器大概每天产生4到5G左右的日志 ...
一、基于hadoop的nginx访问日志分析---解析日志篇
前一阵子,搭建了ELK日志分析平台,用着挺爽的,再也不用给开发拉各种日志,节省了很多时间. 这篇博文是介绍用python代码实现日志分析的,用MRJob实现hadoop上的mapreduce,可以直接 ...

随机推荐

深入解析 composer 的自动加载原理（转）
深入解析 composer 的自动加载原理转自:https://segmentfault.com/a/1190000014948542 前言 PHP 自5.3的版本之后,已经重焕新生,命名空间.性状 ...
完整的Django入门指南学习笔记3
前言在本节课中,我们将深入理解两个基本概念: URLs 和 Forms.在这个过程中,我们还将学习其它很多概念,如创建可重用模板和安装第三方库.同时我们还将编写大量单元测试. 如果你是从这个系列教程 ...
『MXNet』专题汇总
MXNet文档 MXNet官方教程持久化模型框架介绍『MXNet』第一弹_基础架构及API 『MXNet』第二弹_Gluon构建模型『MXNet』第三弹_Gluon模型参数『MXNet』第四 ...
『PyTorch』第四弹_通过LeNet初识pytorch神经网络_上
总结一下相关概念: torch.Tensor - 一个近似多维数组的数据结构 autograd.Variable - 改变Tensor并且记录下来操作的历史记录.和Tensor拥有相同的API,以及b ...
bzoj2431
题意:求有多少个逆序对为k的排列题解:$dp[i][j]$表示1~i的排列中有j个逆序对的方案数,转移就是把i放在1~i-1的排列中的第几位,\(dp[i][j]=\sum_{x=0}^{min ...
Matlab-8：松弛迭代法（SOR）
function [x,n,flag]=sor(A,b,eps,M,max1) %sor函数为用松弛迭代法求解线性方程组 %A为线性方程组的系数矩阵 %b为线性方程组的常数向量 %eps为精度要求 % ...
css单位分析、颜色设置与调色板
CSS单位分析 px:单位代表像素,1px代表一个像素点. %:设置子元素为父容器的占比. em:代表该元素中一个字体所占字符,常用在文字首行缩进.其具有继承性. rem:始终代表html中的字符所在 ...
react-navigation学习笔记
1.关于this.props.navigation.navigate()与this.props.navigation.push()的区别 navigate方法在跳转时会在已有的路由堆栈中查找是否已经存 ...
函数使用五：MIR7 发票预制 BAPI_INCOMINGINVOICE_PARK
引自:http://blog.csdn.net/champaignwolf/article/details/51422329 FUNCTION zincominginvoice_park. *&quo ...
oracle 创建自定义的流水号
; --你确定流水号只要3位? 使用它的下一个值用: seq_abc_taskid.nextval查询当前值用:seq_abc_taskid.currval比如你现在要插入一行到abc,你可以 ,se ...

Nginx+Flume+Hadoop日志分析，Ngram+AutoComplete