Flume笔记

flume自定义拦截器：实现Interceptor接口
flume自定义source：继承AbstractSource
flume自定义sink：继承AbstractSink

azkaban:任务调度工具。正常使用即可
任务调度，定时执行，任务之间的依赖

sqoop:数据导入导出工具
将关系型数据库当中的数据导入到大数据平台 import
将大数据平台的数据导出到关系型数据库 export

导入mysql数据到hdfs上面去，指定字段之间的分隔符，指定导入的路径 -m 定义多少个mapTask来导入数据
100GB的数据，定义多少个mapTask比较合适 10-30个，大概运行在半个小时以内要结束掉。

增量导入有三个选项
一般都是借助 --where条件来实现，或者使用--query来实现

实际工作当中，每个表一般都维护三个字段，create_time ,update_time ,is_deleted
实际工作当中，基本上都是做假删除
根据update_time可以获取每天的更新的数据或者插入的数据

如果数据发生变化，数仓当中一个人存在多条数据，怎么办？？？
减量的数据怎么办？？转化成为更新的数据来操作

datax：也是数据导入导出工具

通过Java代码远程执行linux的命令 sshxcute.jar

点击流日志数据分析：主要是分析nginx的日志数据
点击流日志模型：
原始结构化数据表：
pageView表模型：重视的是每一次页面的访问情况
visit表模型：重视的是每一个session访问的情况

网站分析常见的一些指标：
IP：独立的ip的个数。以cookie来统计今天访问的人数
pv：page View 页面浏览量，一共看了多少个页面，看一个页面，算作一次
uv：unique page View 独立用户访问量，统计的是一共有多少个人来访问了，使用的是cookie来进行统计的

基础指标：访问次数，网站停留时间，页面停留时间

复合指标：人均浏览页数（pv/去重人数），跳出率，退出率

来源分析：分析用户是从哪个渠道过来的
受访分析：网站受到的访问的分析

离线日志分析数据处理架构：
日志数据采集：flume source：TailDirSource channel:memory channel sink:HdfsSink
数据的预处理：mapreduce
数据的入库：load 到hive表当中去
数据的分析：使用hql语句来实现数据的分析
报表的展示：echarts来实现报表展示

维度建模基本概念：维度建模是我们对数据仓库分析常用的一种手段

事实表：主要的作用就是正确的记录已经发生的事件事实一定是已经发生的事情
维度表：主要就是从各个不同的方面来看已经发生的事件得到不一样的结果
昨天去星巴克喝了一杯咖啡，花了两百块。
时间：昨天
地点：星巴克
金额：两百块

横看成岭侧成峰，远近高低各不同

维度表和事实表侧重点不一样：事实表侧重的是整个事件全貌，维度表侧重的是某一方面

维度建模的三种方式：
星形模型：类似于天上的星星一样的

ods_weblog_origin有一个字段time_local yyyy-MM-dd HH:mm:ss

求：06-07点一共访问了多少个pv

select count(1) from ods_weblog_origin where time_local > 06 and time_local <07

select count(1) from ods_weblog_origin where time_local > 07 and time_local <08

求每个小时的pv
select hour,count(1) from ods_weblog_origin group by hour

将时间字段给拆开
2013-09-18 06:49:18 ==》 year: 2013年 month:09 day：18 hour：06

select count(1) from ods_weblog_origin group by substring(time_local,12,2);

数据仓库当中允许数据的冗余。
时间字段需要拆开，使用截串即可 month ,year,day,hour
http_refer 需要拆开，使用parse_url_tuple来进行拆开 host,path ,query,queryId

http_refer :查看我们上一级网址是哪里

"http://cos.name/category/software/packages/?username=zhangsan" ==> jd.com

"http://baidu.com/category/software/packages/?username=zhangsan" ==> jd.com

"http://google.com/category/software/packages/?username=zhangsan" ==> jd.com

"http://360.com/category/software/packages/?username=zhangsan" ==> jd.com

统计从每个网站过来有多少流量：

select hosts,count(1) from ods_weblog_origin group by hosts

+---------------------------+---------------------------------+---------------------------------+--------------------------------+-----------------------------------------------+----------------------------+-------------------------------------+------------------------------------------------+----------------------------------------------------+-----------------------------+--------------------------+-------------------------------+---------------------------+------------------------------+--+
| t_ods_tmp_referurl.valid | t_ods_tmp_referurl.remote_addr | t_ods_tmp_referurl.remote_user | t_ods_tmp_referurl.time_local | t_ods_tmp_referurl.request | t_ods_tmp_referurl.status | t_ods_tmp_referurl.body_bytes_sent | t_ods_tmp_referurl.http_referer | t_ods_tmp_referurl.http_user_agent | t_ods_tmp_referurl.datestr | t_ods_tmp_referurl.host | t_ods_tmp_referurl.path | t_ods_tmp_referurl.query | t_ods_tmp_referurl.query_id |
+---------------------------+---------------------------------+---------------------------------+--------------------------------+-----------------------------------------------+----------------------------+-------------------------------------+------------------------------------------------+----------------------------------------------------+-----------------------------+--------------------------+-------------------------------+---------------------------+------------------------------+--+
| false | 194.237.142.21 | - | 2013-09-18 06:49:18 | /wp-content/uploads/2013/07/rstudio-git3.png | 304 | 0 | "-" | "Mozilla/4.0(compatible;)" | 20130918 | NULL | NULL | NULL | NULL |
| false | 163.177.71.12 | - | 2013-09-18 06:49:33 | / | 200 | 20 | "-" | "DNSPod-Monitor/1.0" | 20130918 | NULL | NULL | NULL | NULL |
| false | 163.177.71.12 | - | 2013-09-18 06:49:36 | / | 200 | 20 | "-" | "DNSPod-Monitor/1.0" | 20130918 | NULL | NULL | NULL | NULL |
| false | 101.226.68.137 | - | 2013-09-18 06:49:42 | / | 200 | 20 | "-" | "DNSPod-Monitor/1.0" | 20130918 | NULL | NULL | NULL | NULL |
| false | 101.226.68.137 | - | 2013-09-18 06:49:45 | / | 200 | 20 | "-" | "DNSPod-Monitor/1.0" | 20130918 | NULL | NULL | NULL | NULL |
| false | 60.208.6.156 | - | 2013-09-18 06:49:48 | /wp-content/uploads/2013/07/rcassandra.png | 200 | 185524 | "http://cos.name/category/software/packages/" | "Mozilla/5.0(WindowsNT6.1)AppleWebKit/537.36(KHTML,likeGecko)Chrome/29.0.1547.66Safari/537.36" | 20130918 | cos.name | /category/software/packages/ | NULL | NULL |
| false | 222.68.172.190 | - | 2013-09-18 06:49:57 | /images/my.jpg | 200 | 19939 | "http://www.angularjs.cn/A00n" | "Mozilla/5.0(WindowsNT6.1)AppleWebKit/537.36(KHTML,likeGecko)Chrome/29.0.1547.66Safari/537.36" | 20130918 | www.angularjs.cn | /A00n | NULL | NULL |
| false | 183.195.232.138 | - | 2013-09-18 06:50:16 | / | 200 | 20 | "-" | "DNSPod-Monitor/1.0" | 20130918 | NULL | NULL | NULL | NULL |
| false | 183.195.232.138 | - | 2013-09-18 06:50:16 | / | 200 | 20 | "-" | "DNSPod-Monitor/1.0" | 20130918 | NULL | NULL | NULL | NULL |
| false | 66.249.66.84 | - | 2013-09-18 06:50:28 | /page/6/ | 200 | 27777 | "-" | "Mozilla/5.0(compatible;Googlebot/2.1;+http://www.google.com/bot.html)" | 20130918 | NULL | NULL | NULL | NULL |
+---------------------------+---------------------------------+---------------------------------+--------------------------------+-----------------------------------------------+----------------------------+-------------------------------------+------------------------------------------------+----------------------------------------------------+-----------------------------+--------------------------+-------------------------------+---------------------------+------------------------------+--+

按照每小时维度进行统计pv

按照来访的维度进行统计pv
1、统计每小时各来访url产生的pv量
每，各，这些关键字都要进行分组

select month,day,hour,ref_host,count(1) from ods_weblog_detail group by month,day,hour,ref_host limit 10;

2、统计每小时各来访host的产生的pv数并排序

select ref_host,month,day,hour,count(1) as total_count from ods_weblog_detail group by ref_host,month,day,hour
order by total_count desc ;

05 google.com google.com baidu.com 360.com 360.com
06 google.com 360.com baidu.com baidu.com

05 google.com 2
05 360.com 2
05 baidu.com 1

06 baidu.com 2
06 google.com 1
06 360.com 1

--需求：按照时间维度，统计一天内各小时产生最多pvs的来源topN top2

统计一天内各个小时来源最多的pvs
select month,day,hour,ref_host,,max(count(1)) from ods_weblog_detail group by month,day,hour,ref_host

每组里面取两个，如果有10组，取20条
select month,day,hour,ref_host,,max(count(1)) from ods_weblog_detail group by month,day,hour,ref_host limit 2

hive当中的分组求topN
https://www.cnblogs.com/wujin/p/6051768.html
id name sal
1 a 10
2 a 12
3 b 13
4 b 12
5 a 14
6 a 15
7 a 13
8 b 11
9 a 16
10 b 17
11 a 14

统计，每个用户获得最大小费金额的前三个

分组求topN row_num over
densen rank over
rank over

9 a 16 1 1 1
6 a 15 2 2 2
11 a 14 3 3 3
5 a 14 4 3 3
7 a 13 5 4 5
2 a 12 6 5 6
1 a 10 7 6 7

10 b 17 1
3 b 13 2
4 b 12 3
8 b 11 4

select id,
name,
sal,
rank()over(partition by name order by sal desc ) rp,
dense_rank() over(partition by name order by sal desc ) drp,
row_number()over(partition by name order by sal desc) rmp
from f_test;

rp drp rmp
10 b 17 1 1 1
3 b 13 2 2 2
4 b 12 3 3 3
8 b 11 4 4 4

9 a 16 1 1 1
6 a 15 2 2 2
11 a 14 3 3 3
5 a 14 3 3 4
7 a 13 5 4 5
2 a 12 6 5 6
1 a 10 7 6 7

hive当中需要注意的函数：行转列，列转行，分组求topN explode reflect

--需求描述：统计今日所有来访者平均请求的页面数。
--总页面请求数/去重总人数

受访分析：网站受到的访问的分析
1、各个页面的pv量

request表示我们请求的url地址，每一个url地址都对应一个页面
select request,count(1) from ods_weblog_detail group by request

2、统计20130918这个分区里面的受访页面的top10
select request,count(1) as total_count from ods_weblog_detail where datestr = '20130918' group by request having request is not null order by total_count desc limit 10;

3、
统计每日最热门页面的top10

访客分析：针对用户进行的分析

以session为次数依据
新老访客：之前有没有来过网站
回头访客：来访问了好多次
单次访客：只来访问了一次

1、需求：按照时间维度来统计独立访客及其产生的pv量
独立访客：每一个独立的访问的用户，叫做独立访客。怎么区分每一个独立的访客：cookie来区分

每个小时，每个独立访客产生的pv量
select month,day,hour,remote_addr,count(1) from ods_weblog_detail month,day,hour,remote_addr

访客visit分析：
-- 回头/单次访客统计
统计哪些用户是回头访客，一天之内访问了好多次
visit表里面
sessionId remote_addr
1 192.168.52.100
2 192.168.52.100

visit表

select remote_addr,count(1) as totol_count from vist group by remote_addr having total_count > 1

查询今日所有回头访客及其访问次数。

select remote_addr,count(1) as total_count from visit where datestr = '20130918' group by remote_addr having total_count > 1

-- 人均访问的频次，
平均一个人访问了多少次
一共访问的次数/去重人数

select count(1)/count(distinct remote_addr) from vist

-- 人均页面浏览量
平均一个人看了多少个页面

select sum(pageVisits)/count(distinct remote_addr) from visit

需求一：求取每个用户每个月总共获得多少小费
select username,month,sum(salary) from t_salary_detail group by username,month;

+-----------+----------+------+--+
| username | month | _c2 |
+-----------+----------+------+--+
| A | 2015-01 | 33 |
| A | 2015-02 | 10 |
| A | 2015-03 | 16 |
| B | 2015-01 | 30 |
| B | 2015-02 | 15 |
| B | 2015-03 | 17 |
+-----------+----------+------+--+

第二个需求：求每个用户累计获得多少小费

+-----------+----------+------+--+
| username | month | salary |
+-----------+----------+------+--+
| A | 2015-01 | 33 | 33
| A | 2015-02 | 10 | 43
| A | 2015-03 | 16 | 59

| B | 2015-01 | 30 | 30
| B | 2015-02 | 15 | 45
| B | 2015-03 | 17 | 62
+-----------+----------+------+--+

select * from (
select username,month,sum(salary) as salary from t_salary_detail group by username,month
) tempTable1 inner join (select username,month,sum(salary) as salary from t_salary_detail group by username,month)
tempTable2 on tempTable1.username = tempTable2.username where tempTable2.month <= tempTable1.month;

+----------------------+-------------------+--------------------+----------------------+-------------------+--------------------+--+
| temptable1.username | temptable1.month | temptable1.salary | temptable2.username | temptable2.month | temptable2.salary |
+----------------------+-------------------+--------------------+----------------------+-------------------+--------------------+--+
| A | 2015-01 | 33 | A | 2015-01 | 33 |

| A | 2015-02 | 10 | A | 2015-01 | 33 |
| A | 2015-02 | 10 | A | 2015-02 | 10 |

| A | 2015-03 | 16 | A | 2015-01 | 33 |
| A | 2015-03 | 16 | A | 2015-02 | 10 |
| A | 2015-03 | 16 | A | 2015-03 | 16 |

| B | 2015-01 | 30 | B | 2015-01 | 30 |
| B | 2015-02 | 15 | B | 2015-01 | 30 |
| B | 2015-02 | 15 | B | 2015-02 | 15 |
| B | 2015-03 | 17 | B | 2015-01 | 30 |
| B | 2015-03 | 17 | B | 2015-02 | 15 |
| B | 2015-03 | 17 | B | 2015-03 | 17 |
+----------------------+-------------------+--------------------+----------------------+-------------------+--------------------+--+

select tempTable1.username,tempTable1.month,sum(tempTable2.salary) from (
select username,month,sum(salary) as salary from t_salary_detail group by username,month
) tempTable1 inner join (select username,month,sum(salary) as salary from t_salary_detail group by username,month)
tempTable2 on tempTable1.username = tempTable2.username where tempTable2.month <= tempTable1.month group by tempTable1.username,tempTable1.month;

+----------------------+-------------------+------+--+
| temptable1.username | temptable1.month | _c2 |
+----------------------+-------------------+------+--+
| A | 2015-01 | 33 |
| A | 2015-02 | 43 |
| A | 2015-03 | 59 |
| B | 2015-01 | 30 |
| B | 2015-02 | 45 |
| B | 2015-03 | 62 |
+----------------------+-------------------+------+--+

hive的级联求和

dw_oute_numbs
+---------------------+----------------------+--+
| dw_oute_numbs.step | dw_oute_numbs.numbs |
+---------------------+----------------------+--+
| step1 | 1029 |
| step2 | 1029 |
| step3 | 1028 |
| step4 | 1018 |
+---------------------+----------------------+--+

求每一步相对于第一步转化率
select a.numbs/b.numbs from dw_oute_numbs a inner join dw_oute_numbs b where b.step = 'step1';

+---------+----------+---------+----------+--+
| a.step | a.numbs | b.step | b.numbs |
+---------+----------+---------+----------+--+
| step1 | 1029 | step1 | 1029 |
| step2 | 1029 | step1 | 1029 |
| step3 | 1028 | step1 | 1029 |
| step4 | 1018 | step1 | 1029 |
+---------+----------+---------+----------+--+

求，每一步相对于上一步的转化率
求每一步相对于第一步转化率
select * from dw_oute_numbs a inner join dw_oute_numbs b ;

| step1 | 1029 | step2 | 1029 |
| step2 | 1029 | step2 | 1029 |
| step3 | 1028 | step2 | 1029 |
| step4 | 1018 | step2 | 1029 |

| step1 | 1029 | step3 | 1028 |
| step2 | 1029 | step3 | 1028 |
| step3 | 1028 | step3 | 1028 |
| step4 | 1018 | step3 | 1028 |

| step1 | 1029 | step4 | 1018 |
| step2 | 1029 | step4 | 1018 |
| step3 | 1028 | step4 | 1018 |
| step4 | 1018 | step4 | 1018 |
+---------+----------+---------+----------+--+

step2 -1 = step1

select from dw_oute_numbs a innser join dw_oute_numbs b on cast (substring(b.step,5,1) as int ) -1 = cast (substring(a.step ,5,1) as int )

hive当中遇到的函数：
parse_url_tuple
substring
concat
cast
sum
count
时间函数得要注意一下

分组函数：
级联求和：
更多函数，参见hive文档：

实际工作当中，一定要注意：dw层的表基本上都是使用orc或者parquet格式的存储

数据导出：
/export/servers/sqoop-1.4.6-cdh5.14.0/bin/sqoop export --connect jdbc:mysql://192.168.29.22:3306/weblog --username root --password 123456 --m 1 --export-dir /user/hive/warehouse/weblog.db/dw_pvs_everyday --table dw_pvs_everyday --input-fields-terminated-by '\001'

/export/servers/sqoop-1.4.6-cdh5.14.0/bin/sqoop export --connect jdbc:mysql://192.168.29.22:3306/weblog --username root --password 123456 --m 1 --export-dir /user/hive/warehouse/weblog.db/dw_pvs_everyhour_oneday/datestr=20130918 --table dw_pvs_everyhour_oneday --input-fields-terminated-by '\001'

/export/servers/sqoop-1.4.6-cdh5.14.0/bin/sqoop export --connect jdbc:mysql://192.168.29.22:3306/weblog --username root --password 123456 --m 1 --export-dir /user/hive/warehouse/weblog.db/dw_pvs_referer_everyhour/datestr=20130918 --table dw_pvs_referer_everyhour --input-fields-terminated-by '\001'

工作流任务调度：
flume数据采集：不需要调度
三个MR的程序：数据清洗，pageView表模型程序，visit表模型程序需要定时调度
ods层的表：分区表，每天加载分区的数据，不需要drop table 每次都要load 数据进入到对应的分区里面去 load数据需要定时执行
dw层的统计分析结果表：需要每天进行drop 或者truncate 需要定时的执行
数据的结果导出：需要定时的执行

课程总结：

visit表模型的创建，涉及到前面三个mr的程序
到hive当中建表，并加载数据 weblog pageView visit
weblog表的拆分时间字段给拆开，http_referer给拆开

各个模块的分析
受访分析，
流量分析等等分组求topN的函数，级联求和（自己关联自己）

数据结果导出 ==》 sqoop导出
定时任务调度 ==》使用azkaban
数据报表展示：三大框架整合

Flume笔记的更多相关文章

Flume笔记--source端监听目录，sink端上传到HDFS
官方文档参数解释:http://flume.apache.org/FlumeUserGuide.html#hdfs-sink 需要注意:文件格式,fileType=DataStream 默认为Sequ ...
Flume笔记--示例(使用配置文件)
例子参考资料:http://www.aboutyun.com/thread-8917-1-1.html 自定义sink实现和属性注入:http://www.coderli.com/flume-ng-s ...
即将上线的flume服务器面临的一系列填坑笔记
即将上线的flume服务器面临的一系列填坑笔记作者:尹正杰版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.flume缺少依赖包导致启动失败! 报错信息如下: 2018-10-17 ...
Hadoop学习笔记—19.Flume框架学习
START:Flume是Cloudera提供的一个高可用的.高可靠的开源分布式海量日志收集系统,日志数据可以经过Flume流向需要存储终端目的地.这里的日志是一个统称,泛指文件.操作记录等许多数据. ...
临时笔记：flume+ CDH 的 twitter实例
http://www.slideshare.net/OpenAnayticsMeetup/analyzing-twitter-data-with-hadoop-17718553 http://www. ...
Flume+Sqoop+Azkaban笔记
大纲(辅助系统) 离线辅助系统数据接入 Flume介绍 Flume组件 Flume实战案例任务调度调度器基础市面上调度工具 Oozie的使用 Oozie的流程定义详解数据导出 sqoop基础 ...
学习笔记：分布式日志收集框架Flume
业务现状分析 WebServer/ApplicationServer分散在各个机器上,想在大数据平台hadoop上进行统计分析,就需要先把日志收集到hadoop平台上. 思考:如何解决我们的数据从其他 ...
flume学习笔记——安装和使用
Flume是一个分布式.可靠.和高可用的海量日志聚合的系统,支持在系统中定制各类数据发送方,用于收集数据:同时,Flume提供对数据进行简单处理,并写到各种数据接受方(可定制)的能力. Flume是一 ...
Apache Flume 学习笔记
# 从http://flume.apache.org/download.html 下载flume ############################################# # 概述: ...

随机推荐

node 连接 mysql 数据库三种方法------笔记
一.mysql库文档:https://github.com/mysqljs/mysql mysql有三种创建连接方式 1.createConnection 使用时需要对连接的创建.断开进行管理 2. ...
全网趣味网站分享：今日热榜/Pixiv高级搜索/win10激活工具/songtaste复活/sharesome汤不热替代者
1.回形针手册由科普类视频节目“回形针PaperClip”近期提出的一个实用百科工具计划,计划名称是回形针手册. 包含了当下科技,农业等等各行各业的各种相关信息,计划刚刚开始! 关于回形针手册的详细 ...
jquery 全选样例
代码: $(function(){ $("#checkAllOld").click(function() { $("input[id^='box_old_']" ...
Spring Boot快速集成kaptcha生成验证码
Kaptcha是一个非常实用的验证码生成工具,可以通过配置生成多样化的验证码,以图片的形式显示,从而无法进行复制粘贴:下面将详细介绍下Spring Boot快速集成kaptcha生成验证码的过程. 本 ...
使用dapper遇到的问题及解决方法
在使用dapper进行数据查询时遇到的一个问题,今天进行问题重现做一个记录,免得忘记以后又犯同样的错误. 自己要实现的是:select * from tablename where id in(1,2 ...
JS基础语法---for循环遍历数组
for循环遍历数组要显示数组中的每个数据,可以如下: var arr=[10,20,30,40,50]; //显示数组中的每个数据 console.log(arr[0]); console.log( ...
mac安装linux
http://www.mamicode.com/info-detail-503881.html
ABP进阶教程1 - 条件查询
点这里进入ABP进阶教程目录添加实体打开领域层(即JD.CRS.Core)的Entitys目录 //用以存放实体对象添加一个枚举StatusCode.cs //状态信息 using System; ...
ESP8266与ESP8285开发时有什么区别
ESP8266模块在WiFi联网领域已经被广泛使用,但是ESP8266芯片是需要外挂Flash芯片的,这样就使模块不能做的更小.之后乐鑫公司又推出了ESP8285芯片,直接集成了1MByte的Flas ...
Oracle EBS如何查找到说明性弹性域Title
Oracle EBS如何查找到说明性弹性域Title 一.方法一:直接在弹性栏位界面查询在EBS中,有部分表已经启用说明性弹性域,我们可以直接在界面得到弹性域对话框的标题,如下图所示,在OM-事务处 ...

Flume笔记

Flume笔记的更多相关文章

随机推荐

热门专题