flink-----实时项目---day04-------1. 案例:统计点击、参与某个活动的人数和次数 2. 活动指标多维度统计(自定义redisSink)
1. 案例
用户ID,活动ID,时间,事件类型,省份
u001,A1,2019-09-02 10:10:11,1,北京市
u001,A1,2019-09-02 14:10:11,1,北京市
u001,A1,2019-09-02 14:10:11,2,北京市
u002,A1,2019-09-02 14:10:11,1,北京市
u002,A2,2019-09-02 14:10:11,1,北京市
u002,A2,2019-09-02 15:10:11,1,北京市
u002,A2,2019-09-02 15:10:11,2,北京市 事件类型:
0:曝光
1:点击
2:参与 需求:统计点击、参与某个活动的人数和次数
- 方案一:使用ValueState结合HashSet实现
具体代码如下
ActivityCountAdv1
package cn._51doit.flink.day08; import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple4;
import org.apache.flink.api.java.tuple.Tuple5;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.LocalStreamEnvironment;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector; import java.util.HashSet; public class ActivityCountAdv1 {
public static void main(String[] args) throws Exception {
LocalStreamEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
DataStreamSource<String> lines = env.socketTextStream("feng05", 8888);
// 对数据进行切分整理
SingleOutputStreamOperator<Tuple5<String, String, String, Integer, String>> tpDataStream = lines.map(new MapFunction<String, Tuple5<String, String, String, Integer, String>>() {
@Override
public Tuple5<String, String, String, Integer, String> map(String line) throws Exception {
String[] fields = line.split(",");
String uid = fields[0];
String activityID = fields[1];
String date = fields[2];
Integer type = Integer.parseInt(fields[3]);
String prince = fields[4];
return Tuple5.of(uid, activityID, date, type, prince);
}
});
// 按照活动ID和事件类型分组
KeyedStream<Tuple5<String, String, String, Integer, String>, Tuple> keyed = tpDataStream.keyBy(1, 3); keyed.process(new KeyedProcessFunction<Tuple, Tuple5<String, String, String, Integer, String>, Tuple4<String, Integer, Integer, Integer>>() {
//保存去重后用户ID的HashSet
private transient ValueState<HashSet<String>> uidState; //保存次数的Integer类型
private transient ValueState<Integer> countState; @Override
public void open(Configuration parameters) throws Exception {
// 定义一个状态描述器
ValueStateDescriptor<HashSet<String>> stateDescriptor1 = new ValueStateDescriptor<HashSet<String>>(
"uid-state",
TypeInformation.of(new TypeHint<HashSet<String>>(){})
);
// 定义一个状态描述器
ValueStateDescriptor<Integer> stateDescriptor2 = new ValueStateDescriptor<Integer>(
"count-state",
Integer.class
);
// 获取状态
uidState = getRuntimeContext().getState(stateDescriptor1);
countState = getRuntimeContext().getState(stateDescriptor2);
} @Override
public void processElement(Tuple5<String, String, String, Integer, String> value, Context ctx, Collector<Tuple4<String, Integer, Integer, Integer>> out) throws Exception {
String uid = value.f0;
String aid = value.f1;
Integer type = value.f3;
//使用HashSet进行判断去重,更新uidState
HashSet<String> hashSet = uidState.value();
if(hashSet == null){
hashSet = new HashSet<>();
}
hashSet.add(uid);
uidState.update(hashSet);
// 计算人数
Integer count = countState.value();
if(count == null) {
count = 0;
}
count += 1;
countState.update(count);
out.collect(Tuple4.of(aid,type,hashSet.size(), count));
}
}).print();
env.execute();
}
}
如果使用HashSet去重,用户实例较大,会大量消耗资源,导致性能变低,甚至内存溢出
- 方案二:改进,使用BloomFilter存储用户的ID,BloomFilter可以判断用户一定不存在,使用的内存极少。但是使用BloomFilter没有计数器,就必须额外定义一个状态,存储去重的人数
ActivityCountAdv2
package cn._51doit.flink.day08; import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple4;
import org.apache.flink.api.java.tuple.Tuple5;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.shaded.guava18.com.google.common.hash.BloomFilter;
import org.apache.flink.shaded.guava18.com.google.common.hash.Funnels;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector; import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.util.HashSet; public class ActivityCountAdv2 { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); //u001,A1,2019-09-02 10:10:11,1,北京市
DataStreamSource<String> lines = env.socketTextStream("localhost", 8888); //对数据进行切分整理
SingleOutputStreamOperator<Tuple5<String, String, String, String, String>> tpDataStream = lines.map(new MapFunction<String, Tuple5<String, String, String, String, String>>() {
@Override
public Tuple5<String, String, String, String, String> map(String line) throws Exception {
String[] fields = line.split(",");
String uid = fields[0];
String aid = fields[1];
String time = fields[2];
String type = fields[3];
String province = fields[4];
return Tuple5.of(uid, aid, time, type, province);
}
}); //按照活动ID和事件类型分组
KeyedStream<Tuple5<String, String, String, String, String>, Tuple> keyed = tpDataStream.keyBy(1, 3); keyed.process(new KeyedProcessFunction<Tuple, Tuple5<String, String, String, String, String>, Tuple4<String, String, Integer, Integer>>() { //保存去重后用户ID的HashSet
private transient ValueState<BloomFilter> uidState; //保存用户ID去重的次数的Integer类型
private transient ValueState<Integer> uidCountState; //保存次数的Integer类型(未去重的)
private transient ValueState<Integer> countState; @Override
public void open(Configuration parameters) throws Exception {
//定义一个状态描述器
ValueStateDescriptor<BloomFilter> stateDescriptor1 = new ValueStateDescriptor<BloomFilter>(
"uid-state",
TypeInformation.of(new TypeHint<BloomFilter>(){})
); //定义一个状态描述器
ValueStateDescriptor<Integer> stateDescriptor2 = new ValueStateDescriptor<Integer>(
"count-state",
Integer.class
); //定义一个状态描述器
ValueStateDescriptor<Integer> stateDescriptor3 = new ValueStateDescriptor<Integer>(
"uid-count-state",
Integer.class
);
//获取状态
//获取状态
uidState = getRuntimeContext().getState(stateDescriptor1);
countState = getRuntimeContext().getState(stateDescriptor2);
uidCountState = getRuntimeContext().getState(stateDescriptor3);
} @Override
public void processElement(Tuple5<String, String, String, String, String> value, Context ctx, Collector<Tuple4<String, String, Integer, Integer>> out) throws Exception {
String uid = value.f0;
String aid = value.f1;
String type = value.f3;
//使用HashSet进行判断去重
BloomFilter bloomFilter = uidState.value();
Integer uidCount = uidCountState.value(); //人数
Integer count = countState.value(); //次数
if(count == null) {
count = 0;
}
if(bloomFilter == null) {
bloomFilter = BloomFilter.create(Funnels.unencodedCharsFunnel(), 10000000);
uidCount = 0;
}
if(!bloomFilter.mightContain(uid)) {
bloomFilter.put(uid); //添加到BloomFilter中
uidCount += 1;
}
count += 1;
countState.update(count);
uidState.update(bloomFilter);
uidCountState.update(uidCount);
out.collect(Tuple4.of(aid, type, uidCount, count));
}
}).print(); env.execute(); }
}
2. 活动指标多维度统计
此处要进行多次key操作(一中维度就需要keyBy一次),相当繁琐。此处是通过将数据存入redis,所以不需要使用flink中的state,具体见代码
ActivityCountWithMultiDimension
package cn._51doit.flink.day08; import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import sun.awt.geom.AreaOp; public class ActivityCountWithMultiDimension { public static void main(String[] args) throws Exception{ ParameterTool parameters = ParameterTool.fromPropertiesFile(args[0]); StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.getConfig().setGlobalJobParameters(parameters); //u001,A1,2019-09-02 10:10:11,1,北京市
DataStreamSource<String> lines = env.socketTextStream("localhost", 8888); SingleOutputStreamOperator<ActivityBean> beanStream = lines.map(new MapFunction<String, ActivityBean>() { @Override
public ActivityBean map(String line) throws Exception {
String[] fields = line.split(",");
String uid = fields[0];
String aid = fields[1];
String date = fields[2].split(" ")[0];
String type = fields[3];
String province = fields[4];
return ActivityBean.of(uid, aid, date, type, province);
}
}); SingleOutputStreamOperator<ActivityBean> res1 = beanStream.keyBy("aid", "type").sum("count"); SingleOutputStreamOperator<ActivityBean> res2 = beanStream.keyBy("aid", "type", "date").sum("count"); SingleOutputStreamOperator<ActivityBean> res3 = beanStream.keyBy("aid", "type", "date", "province").sum("count"); res1.map(new MapFunction<ActivityBean, Tuple3<String, String, String>>() {
@Override
public Tuple3<String, String, String> map(ActivityBean value) throws Exception {
return Tuple3.of(Constant.ACTIVITY_COUNT +"-"+ value.aid, value.type, value.count.toString());
}
}).addSink(new MyRedisSink()); res2.map(new MapFunction<ActivityBean, Tuple3<String, String, String>>() {
@Override
public Tuple3<String, String, String> map(ActivityBean value) throws Exception {
return Tuple3.of(Constant.DAILY_ACTIVITY_COUNT + "-" + value.aid + "-" + value.date, value.type, value.count.toString());
}
}).addSink(new MyRedisSink()); res3.map(new MapFunction<ActivityBean, Tuple3<String, String, String>>() {
@Override
public Tuple3<String, String, String> map(ActivityBean value) throws Exception {
return Tuple3.of(Constant.PROVINCE_DAILY_ACTIVITY_COUNT + "-" + value.aid + "-" + value.date + "-" + value.province, value.type, value.count.toString());
}
}).addSink(new MyRedisSink()); env.execute();
}
}
Constant
package cn._51doit.flink.day08; public class Constant { public static final String ACTIVITY_COUNT = "ACTIVITY_COUNT"; public static final String DAILY_ACTIVITY_COUNT = "DAILY_ACTIVITY_COUNT"; public static final String PROVINCE_DAILY_ACTIVITY_COUNT = "PROVINCE_DAILY_ACTIVITY_COUNT";
}
ActivityBean
package cn._51doit.flink.day08; public class ActivityBean { public String uid; public String aid; public String date; public String type; public String province; public Long count = 1L; public ActivityBean() {} public ActivityBean(String uid, String aid, String date, String type, String province) {
this.uid = uid;
this.aid = aid;
this.date = date;
this.type = type;
this.province = province;
} public static ActivityBean of(String uid, String aid, String date, String type, String province) {
return new ActivityBean(uid, aid, date, type, province);
} @Override
public String toString() {
return "ActivityBean{" +
"uid='" + uid + '\'' +
", aid='" + aid + '\'' +
", date='" + date + '\'' +
", type='" + type + '\'' +
", province='" + province + '\'' +
'}';
}
}
MyRedisSink
package cn._51doit.flink.day08; import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction;
import redis.clients.jedis.Jedis; public class MyRedisSink extends RichSinkFunction<Tuple3<String, String, String>> { private transient Jedis jedis; @Override
public void open(Configuration parameters) throws Exception {
ParameterTool params = (ParameterTool) getRuntimeContext()
.getExecutionConfig()
.getGlobalJobParameters();
String host = params.getRequired("redis.host");
String password = params.getRequired("redis.password");
int port = params.getInt("redis.port", 6379);
int db = params.getInt("redis.db", 0);
Jedis jedis = new Jedis(host, port);
jedis.auth(password);
jedis.select(db);
this.jedis = jedis;
} @Override
public void invoke(Tuple3<String, String, String> value, Context context) throws Exception {
if (!jedis.isConnected()) {
jedis.connect();
}
jedis.hset(value.f0, value.f1, value.f2);
} @Override
public void close() throws Exception {
jedis.close();
}
}
flink-----实时项目---day04-------1. 案例:统计点击、参与某个活动的人数和次数 2. 活动指标多维度统计(自定义redisSink)的更多相关文章
- 5.Flink实时项目之业务数据准备
1. 流程介绍 在上一篇文章中,我们已经把客户端的页面日志,启动日志,曝光日志分别发送到kafka对应的主题中.在本文中,我们将把业务数据也发送到对应的kafka主题中. 通过maxwell采集业务数 ...
- 9.Flink实时项目之订单宽表
1.需求分析 订单是统计分析的重要的对象,围绕订单有很多的维度统计需求,比如用户.地区.商品.品类.品牌等等.为了之后统计计算更加方便,减少大表之间的关联,所以在实时计算过程中将围绕订单的相关数据整合 ...
- 7.Flink实时项目之独立访客开发
1.架构说明 在上6节当中,我们已经完成了从ods层到dwd层的转换,包括日志数据和业务数据,下面我们开始做dwm层的任务. DWM 层主要服务 DWS,因为部分需求直接从 DWD 层到DWS 层中间 ...
- 10.Flink实时项目之订单维度表关联
1. 维度查询 在上一篇中,我们已经把订单和订单明细表join完,本文将关联订单的其他维度数据,维度关联实际上就是在流中查询存储在 hbase 中的数据表.但是即使通过主键的方式查询,hbase 速度 ...
- 11.Flink实时项目之支付宽表
支付宽表 支付宽表的目的,最主要的原因是支付表没有到订单明细,支付金额没有细分到商品上, 没有办法统计商品级的支付状况. 所以本次宽表的核心就是要把支付表的信息与订单明细关联上. 解决方案有两个 一个 ...
- 3.Flink实时项目之流程分析及环境搭建
1. 流程分析 前面已经将日志数据(ods_base_log)及业务数据(ods_base_db_m)发送到kafka,作为ods层,接下来要做的就是通过flink消费kafka 的ods数据,进行简 ...
- 4.Flink实时项目之数据拆分
1. 摘要 我们前面采集的日志数据已经保存到 Kafka 中,作为日志数据的 ODS 层,从 kafka 的ODS 层读取的日志数据分为 3 类, 页面日志.启动日志和曝光日志.这三类数据虽然都是用户 ...
- 6.Flink实时项目之业务数据分流
在上一篇文章中,我们已经获取到了业务数据的输出流,分别是dim层维度数据的输出流,及dwd层事实数据的输出流,接下来我们要做的就是把这些输出流分别再流向对应的数据介质中,dim层流向hbase中,dw ...
- 1.Flink实时项目前期准备
1.日志生成项目 日志生成机器:hadoop101 jar包:mock-log-0.0.1-SNAPSHOT.jar gmall_mock |----mock_common |----mock ...
随机推荐
- linux job
通常运行的进程 ctrl-z之后会暂停到后台 bash test.sh Linux-4.15.0-36-generic-x86_64-with-Ubuntu-16.04-xenial #39~16.0 ...
- 整数中1出现的次数 牛客网 剑指Offer
整数中1出现的次数 牛客网 剑指Offer 题目描述 求出113的整数中1出现的次数,并算出1001300的整数中1出现的次数?为此他特别数了一下1~13中包含1的数字有1.10.11.12.13因此 ...
- django HTML 数据处理
一.介绍 dgango HTML 对 各种数据类型数据的调用展示 的个人工作总结 二.数据处理 1.元祖数据 t1 =('a','b','c',) 示例: {{ t1.0 }} {{ ...
- Centos 系统常用编译环境
centos编译环境配置 yum install -y autoconf make automake gcc gcc-c++
- 性能工具之代码级性能测试工具ContiPerf
前言 做性能的同学一定遇到过这样的场景:应用级别的性能测试发现一个操作的响应时间很长,然后要花费很多时间去逐级排查,最后却发现罪魁祸首是代码中某个实现低效的底层算法.这种自上而下的逐级排查定位的方法, ...
- Cain工具的使用
这次是用windows xp当肉鸡,用Windows2003进行监听 这是一个基于ARP协议的漏洞的攻击 先要确认两个虚拟机之间能够互相ping通和都能正常访问网页 首先安装好Cain后,张这个样子: ...
- initNativeTransServiceId . ntrans:object componentId :-368613127 微信小程序
二维码打开的页面是否存在 注意:体验版二维码默认路径是 pages/index/index 我的因为分包的原因调整了首页路径 所以路径是pages/tabBar/search/search 如果不是这 ...
- python -m参数
把模块当做脚本运行,标准库和第三方库都可以 会把当前路径添加到sys.path中
- 通过大量实战案例分解Netty中是如何解决拆包黏包问题的?
TCP传输协议是基于数据流传输的,而基于流化的数据是没有界限的,当客户端向服务端发送数据时,可能会把一个完整的数据报文拆分成多个小报文进行发送,也可能将多个报文合并成一个大报文进行发送. 在这样的情况 ...
- cgdb | 一起边看源码边调试gdb吧
简介 cgdb是一款轻量级的基于gdb的命令行可视化工具,关系大致如下: 尽管gdb本身可以通过layout src的命令显示源码布局,但是其功能还是过于简陋. 使用cgdb并不需要你重新去学习过多额 ...