flink-----实时项目---day04-------1. 案例:统计点击、参与某个活动的人数和次数 2. 活动指标多维度统计(自定义redisSink)
1. 案例
用户ID,活动ID,时间,事件类型,省份
u001,A1,2019-09-02 10:10:11,1,北京市
u001,A1,2019-09-02 14:10:11,1,北京市
u001,A1,2019-09-02 14:10:11,2,北京市
u002,A1,2019-09-02 14:10:11,1,北京市
u002,A2,2019-09-02 14:10:11,1,北京市
u002,A2,2019-09-02 15:10:11,1,北京市
u002,A2,2019-09-02 15:10:11,2,北京市 事件类型:
0:曝光
1:点击
2:参与 需求:统计点击、参与某个活动的人数和次数
- 方案一:使用ValueState结合HashSet实现
具体代码如下
ActivityCountAdv1
package cn._51doit.flink.day08; import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple4;
import org.apache.flink.api.java.tuple.Tuple5;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.LocalStreamEnvironment;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector; import java.util.HashSet; public class ActivityCountAdv1 {
public static void main(String[] args) throws Exception {
LocalStreamEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
DataStreamSource<String> lines = env.socketTextStream("feng05", 8888);
// 对数据进行切分整理
SingleOutputStreamOperator<Tuple5<String, String, String, Integer, String>> tpDataStream = lines.map(new MapFunction<String, Tuple5<String, String, String, Integer, String>>() {
@Override
public Tuple5<String, String, String, Integer, String> map(String line) throws Exception {
String[] fields = line.split(",");
String uid = fields[0];
String activityID = fields[1];
String date = fields[2];
Integer type = Integer.parseInt(fields[3]);
String prince = fields[4];
return Tuple5.of(uid, activityID, date, type, prince);
}
});
// 按照活动ID和事件类型分组
KeyedStream<Tuple5<String, String, String, Integer, String>, Tuple> keyed = tpDataStream.keyBy(1, 3); keyed.process(new KeyedProcessFunction<Tuple, Tuple5<String, String, String, Integer, String>, Tuple4<String, Integer, Integer, Integer>>() {
//保存去重后用户ID的HashSet
private transient ValueState<HashSet<String>> uidState; //保存次数的Integer类型
private transient ValueState<Integer> countState; @Override
public void open(Configuration parameters) throws Exception {
// 定义一个状态描述器
ValueStateDescriptor<HashSet<String>> stateDescriptor1 = new ValueStateDescriptor<HashSet<String>>(
"uid-state",
TypeInformation.of(new TypeHint<HashSet<String>>(){})
);
// 定义一个状态描述器
ValueStateDescriptor<Integer> stateDescriptor2 = new ValueStateDescriptor<Integer>(
"count-state",
Integer.class
);
// 获取状态
uidState = getRuntimeContext().getState(stateDescriptor1);
countState = getRuntimeContext().getState(stateDescriptor2);
} @Override
public void processElement(Tuple5<String, String, String, Integer, String> value, Context ctx, Collector<Tuple4<String, Integer, Integer, Integer>> out) throws Exception {
String uid = value.f0;
String aid = value.f1;
Integer type = value.f3;
//使用HashSet进行判断去重,更新uidState
HashSet<String> hashSet = uidState.value();
if(hashSet == null){
hashSet = new HashSet<>();
}
hashSet.add(uid);
uidState.update(hashSet);
// 计算人数
Integer count = countState.value();
if(count == null) {
count = 0;
}
count += 1;
countState.update(count);
out.collect(Tuple4.of(aid,type,hashSet.size(), count));
}
}).print();
env.execute();
}
}
如果使用HashSet去重,用户实例较大,会大量消耗资源,导致性能变低,甚至内存溢出
- 方案二:改进,使用BloomFilter存储用户的ID,BloomFilter可以判断用户一定不存在,使用的内存极少。但是使用BloomFilter没有计数器,就必须额外定义一个状态,存储去重的人数
ActivityCountAdv2
package cn._51doit.flink.day08; import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple4;
import org.apache.flink.api.java.tuple.Tuple5;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.shaded.guava18.com.google.common.hash.BloomFilter;
import org.apache.flink.shaded.guava18.com.google.common.hash.Funnels;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector; import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.util.HashSet; public class ActivityCountAdv2 { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); //u001,A1,2019-09-02 10:10:11,1,北京市
DataStreamSource<String> lines = env.socketTextStream("localhost", 8888); //对数据进行切分整理
SingleOutputStreamOperator<Tuple5<String, String, String, String, String>> tpDataStream = lines.map(new MapFunction<String, Tuple5<String, String, String, String, String>>() {
@Override
public Tuple5<String, String, String, String, String> map(String line) throws Exception {
String[] fields = line.split(",");
String uid = fields[0];
String aid = fields[1];
String time = fields[2];
String type = fields[3];
String province = fields[4];
return Tuple5.of(uid, aid, time, type, province);
}
}); //按照活动ID和事件类型分组
KeyedStream<Tuple5<String, String, String, String, String>, Tuple> keyed = tpDataStream.keyBy(1, 3); keyed.process(new KeyedProcessFunction<Tuple, Tuple5<String, String, String, String, String>, Tuple4<String, String, Integer, Integer>>() { //保存去重后用户ID的HashSet
private transient ValueState<BloomFilter> uidState; //保存用户ID去重的次数的Integer类型
private transient ValueState<Integer> uidCountState; //保存次数的Integer类型(未去重的)
private transient ValueState<Integer> countState; @Override
public void open(Configuration parameters) throws Exception {
//定义一个状态描述器
ValueStateDescriptor<BloomFilter> stateDescriptor1 = new ValueStateDescriptor<BloomFilter>(
"uid-state",
TypeInformation.of(new TypeHint<BloomFilter>(){})
); //定义一个状态描述器
ValueStateDescriptor<Integer> stateDescriptor2 = new ValueStateDescriptor<Integer>(
"count-state",
Integer.class
); //定义一个状态描述器
ValueStateDescriptor<Integer> stateDescriptor3 = new ValueStateDescriptor<Integer>(
"uid-count-state",
Integer.class
);
//获取状态
//获取状态
uidState = getRuntimeContext().getState(stateDescriptor1);
countState = getRuntimeContext().getState(stateDescriptor2);
uidCountState = getRuntimeContext().getState(stateDescriptor3);
} @Override
public void processElement(Tuple5<String, String, String, String, String> value, Context ctx, Collector<Tuple4<String, String, Integer, Integer>> out) throws Exception {
String uid = value.f0;
String aid = value.f1;
String type = value.f3;
//使用HashSet进行判断去重
BloomFilter bloomFilter = uidState.value();
Integer uidCount = uidCountState.value(); //人数
Integer count = countState.value(); //次数
if(count == null) {
count = 0;
}
if(bloomFilter == null) {
bloomFilter = BloomFilter.create(Funnels.unencodedCharsFunnel(), 10000000);
uidCount = 0;
}
if(!bloomFilter.mightContain(uid)) {
bloomFilter.put(uid); //添加到BloomFilter中
uidCount += 1;
}
count += 1;
countState.update(count);
uidState.update(bloomFilter);
uidCountState.update(uidCount);
out.collect(Tuple4.of(aid, type, uidCount, count));
}
}).print(); env.execute(); }
}
2. 活动指标多维度统计
此处要进行多次key操作(一中维度就需要keyBy一次),相当繁琐。此处是通过将数据存入redis,所以不需要使用flink中的state,具体见代码
ActivityCountWithMultiDimension
package cn._51doit.flink.day08; import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import sun.awt.geom.AreaOp; public class ActivityCountWithMultiDimension { public static void main(String[] args) throws Exception{ ParameterTool parameters = ParameterTool.fromPropertiesFile(args[0]); StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.getConfig().setGlobalJobParameters(parameters); //u001,A1,2019-09-02 10:10:11,1,北京市
DataStreamSource<String> lines = env.socketTextStream("localhost", 8888); SingleOutputStreamOperator<ActivityBean> beanStream = lines.map(new MapFunction<String, ActivityBean>() { @Override
public ActivityBean map(String line) throws Exception {
String[] fields = line.split(",");
String uid = fields[0];
String aid = fields[1];
String date = fields[2].split(" ")[0];
String type = fields[3];
String province = fields[4];
return ActivityBean.of(uid, aid, date, type, province);
}
}); SingleOutputStreamOperator<ActivityBean> res1 = beanStream.keyBy("aid", "type").sum("count"); SingleOutputStreamOperator<ActivityBean> res2 = beanStream.keyBy("aid", "type", "date").sum("count"); SingleOutputStreamOperator<ActivityBean> res3 = beanStream.keyBy("aid", "type", "date", "province").sum("count"); res1.map(new MapFunction<ActivityBean, Tuple3<String, String, String>>() {
@Override
public Tuple3<String, String, String> map(ActivityBean value) throws Exception {
return Tuple3.of(Constant.ACTIVITY_COUNT +"-"+ value.aid, value.type, value.count.toString());
}
}).addSink(new MyRedisSink()); res2.map(new MapFunction<ActivityBean, Tuple3<String, String, String>>() {
@Override
public Tuple3<String, String, String> map(ActivityBean value) throws Exception {
return Tuple3.of(Constant.DAILY_ACTIVITY_COUNT + "-" + value.aid + "-" + value.date, value.type, value.count.toString());
}
}).addSink(new MyRedisSink()); res3.map(new MapFunction<ActivityBean, Tuple3<String, String, String>>() {
@Override
public Tuple3<String, String, String> map(ActivityBean value) throws Exception {
return Tuple3.of(Constant.PROVINCE_DAILY_ACTIVITY_COUNT + "-" + value.aid + "-" + value.date + "-" + value.province, value.type, value.count.toString());
}
}).addSink(new MyRedisSink()); env.execute();
}
}
Constant
package cn._51doit.flink.day08; public class Constant { public static final String ACTIVITY_COUNT = "ACTIVITY_COUNT"; public static final String DAILY_ACTIVITY_COUNT = "DAILY_ACTIVITY_COUNT"; public static final String PROVINCE_DAILY_ACTIVITY_COUNT = "PROVINCE_DAILY_ACTIVITY_COUNT";
}
ActivityBean
package cn._51doit.flink.day08; public class ActivityBean { public String uid; public String aid; public String date; public String type; public String province; public Long count = 1L; public ActivityBean() {} public ActivityBean(String uid, String aid, String date, String type, String province) {
this.uid = uid;
this.aid = aid;
this.date = date;
this.type = type;
this.province = province;
} public static ActivityBean of(String uid, String aid, String date, String type, String province) {
return new ActivityBean(uid, aid, date, type, province);
} @Override
public String toString() {
return "ActivityBean{" +
"uid='" + uid + '\'' +
", aid='" + aid + '\'' +
", date='" + date + '\'' +
", type='" + type + '\'' +
", province='" + province + '\'' +
'}';
}
}
MyRedisSink
package cn._51doit.flink.day08; import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction;
import redis.clients.jedis.Jedis; public class MyRedisSink extends RichSinkFunction<Tuple3<String, String, String>> { private transient Jedis jedis; @Override
public void open(Configuration parameters) throws Exception {
ParameterTool params = (ParameterTool) getRuntimeContext()
.getExecutionConfig()
.getGlobalJobParameters();
String host = params.getRequired("redis.host");
String password = params.getRequired("redis.password");
int port = params.getInt("redis.port", 6379);
int db = params.getInt("redis.db", 0);
Jedis jedis = new Jedis(host, port);
jedis.auth(password);
jedis.select(db);
this.jedis = jedis;
} @Override
public void invoke(Tuple3<String, String, String> value, Context context) throws Exception {
if (!jedis.isConnected()) {
jedis.connect();
}
jedis.hset(value.f0, value.f1, value.f2);
} @Override
public void close() throws Exception {
jedis.close();
}
}
flink-----实时项目---day04-------1. 案例:统计点击、参与某个活动的人数和次数 2. 活动指标多维度统计(自定义redisSink)的更多相关文章
- 5.Flink实时项目之业务数据准备
1. 流程介绍 在上一篇文章中,我们已经把客户端的页面日志,启动日志,曝光日志分别发送到kafka对应的主题中.在本文中,我们将把业务数据也发送到对应的kafka主题中. 通过maxwell采集业务数 ...
- 9.Flink实时项目之订单宽表
1.需求分析 订单是统计分析的重要的对象,围绕订单有很多的维度统计需求,比如用户.地区.商品.品类.品牌等等.为了之后统计计算更加方便,减少大表之间的关联,所以在实时计算过程中将围绕订单的相关数据整合 ...
- 7.Flink实时项目之独立访客开发
1.架构说明 在上6节当中,我们已经完成了从ods层到dwd层的转换,包括日志数据和业务数据,下面我们开始做dwm层的任务. DWM 层主要服务 DWS,因为部分需求直接从 DWD 层到DWS 层中间 ...
- 10.Flink实时项目之订单维度表关联
1. 维度查询 在上一篇中,我们已经把订单和订单明细表join完,本文将关联订单的其他维度数据,维度关联实际上就是在流中查询存储在 hbase 中的数据表.但是即使通过主键的方式查询,hbase 速度 ...
- 11.Flink实时项目之支付宽表
支付宽表 支付宽表的目的,最主要的原因是支付表没有到订单明细,支付金额没有细分到商品上, 没有办法统计商品级的支付状况. 所以本次宽表的核心就是要把支付表的信息与订单明细关联上. 解决方案有两个 一个 ...
- 3.Flink实时项目之流程分析及环境搭建
1. 流程分析 前面已经将日志数据(ods_base_log)及业务数据(ods_base_db_m)发送到kafka,作为ods层,接下来要做的就是通过flink消费kafka 的ods数据,进行简 ...
- 4.Flink实时项目之数据拆分
1. 摘要 我们前面采集的日志数据已经保存到 Kafka 中,作为日志数据的 ODS 层,从 kafka 的ODS 层读取的日志数据分为 3 类, 页面日志.启动日志和曝光日志.这三类数据虽然都是用户 ...
- 6.Flink实时项目之业务数据分流
在上一篇文章中,我们已经获取到了业务数据的输出流,分别是dim层维度数据的输出流,及dwd层事实数据的输出流,接下来我们要做的就是把这些输出流分别再流向对应的数据介质中,dim层流向hbase中,dw ...
- 1.Flink实时项目前期准备
1.日志生成项目 日志生成机器:hadoop101 jar包:mock-log-0.0.1-SNAPSHOT.jar gmall_mock |----mock_common |----mock ...
随机推荐
- ICPC Mid-Central USA Region 2019 题解
队友牛逼!带我超神!蒟蒻的我还是一点一点的整理题吧... Dragon Ball I 这个题算是比较裸的题目吧....学过图论的大概都知道应该怎么做.题目要求找到七个龙珠的最小距离.很明显就是7个龙珠 ...
- cf17A Noldbach problem(额,,,素数,,,)
题意: 判断从[2,N]中是否有超过[包括]K个数满足:等于一加两个相邻的素数. 思路: 枚举. 也可以:筛完素数,枚举素数,直到相邻素数和超过N.统计个数 代码: int n,k; int prim ...
- hdu 2191 珍惜现在,感恩生活(多重背包)
题意: 有N元经费,M种大米,每种大米有单袋价格p元,单袋重量h,以及对应袋数c. 问最多可以买多重的大米. 思路: 经典多重背包,用二进制的方法. 看代码 代码: struct node{ int ...
- 编译安装与gcc编译器
先说一下gcc编译器,只知道这是一个老款的编译器.编译的目的也比较重要,就是将c语言的代码编译成可以执行的binary文件. gcc 简单使用(后期补充) eg: gcc example.c # ...
- 设置IDEA启动,不要自动打开上次使用时的项目
打开idea时自动加载最近编辑的项目,很费时间,关闭设置如下
- Electron结合React,在渲染进程中使用 node 模块
Electron结合React,在渲染进程中使用 node 模块 问题 将create-react-app与electron集成在了一个项目中.但是在React中无法使用electron.当在Reac ...
- webpack 之 js语法检查eslint
webpack 之 js语法检查eslint // 用来拼接绝对路径的方法 const {resolve} = require('path') const HtmlWebpackPlugin = re ...
- Linux ns 6. Network Namespace 详解
文章目录 1. 简介 1.1 Docker Network 桥接模式配置 2. 代码解析 2.1 copy_net_ns() 2.2 pernet_list 2.2.1 loopback_net_op ...
- 全面的Docker快速入门教程
前言: 都2021年了,你还在为了安装一个开发或者部署环境.软件而花费半天的时间吗?你还在解决开发环境能够正常访问,而发布测试环境无法正常访问的问题吗?你还在为持续集成和持续交付(CI / CD)工作 ...
- SpringCloud升级之路2020.0.x版-36. 验证断路器正确性
本系列代码地址:https://github.com/JoJoTec/spring-cloud-parent 上一节我们通过单元测试验证了线程隔离的正确性,这一节我们来验证我们断路器的正确性,主要包括 ...