1. 案例

用户ID,活动ID,时间,事件类型,省份
u001,A1,2019-09-02 10:10:11,1,北京市
u001,A1,2019-09-02 14:10:11,1,北京市
u001,A1,2019-09-02 14:10:11,2,北京市
u002,A1,2019-09-02 14:10:11,1,北京市
u002,A2,2019-09-02 14:10:11,1,北京市
u002,A2,2019-09-02 15:10:11,1,北京市
u002,A2,2019-09-02 15:10:11,2,北京市 事件类型:
0:曝光
1:点击
2:参与 需求:统计点击、参与某个活动的人数和次数
  • 方案一:使用ValueState结合HashSet实现

 具体代码如下

ActivityCountAdv1

package cn._51doit.flink.day08;

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple4;
import org.apache.flink.api.java.tuple.Tuple5;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.LocalStreamEnvironment;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector; import java.util.HashSet; public class ActivityCountAdv1 {
public static void main(String[] args) throws Exception {
LocalStreamEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
DataStreamSource<String> lines = env.socketTextStream("feng05", 8888);
// 对数据进行切分整理
SingleOutputStreamOperator<Tuple5<String, String, String, Integer, String>> tpDataStream = lines.map(new MapFunction<String, Tuple5<String, String, String, Integer, String>>() {
@Override
public Tuple5<String, String, String, Integer, String> map(String line) throws Exception {
String[] fields = line.split(",");
String uid = fields[0];
String activityID = fields[1];
String date = fields[2];
Integer type = Integer.parseInt(fields[3]);
String prince = fields[4];
return Tuple5.of(uid, activityID, date, type, prince);
}
});
// 按照活动ID和事件类型分组
KeyedStream<Tuple5<String, String, String, Integer, String>, Tuple> keyed = tpDataStream.keyBy(1, 3); keyed.process(new KeyedProcessFunction<Tuple, Tuple5<String, String, String, Integer, String>, Tuple4<String, Integer, Integer, Integer>>() {
//保存去重后用户ID的HashSet
private transient ValueState<HashSet<String>> uidState; //保存次数的Integer类型
private transient ValueState<Integer> countState; @Override
public void open(Configuration parameters) throws Exception {
// 定义一个状态描述器
ValueStateDescriptor<HashSet<String>> stateDescriptor1 = new ValueStateDescriptor<HashSet<String>>(
"uid-state",
TypeInformation.of(new TypeHint<HashSet<String>>(){})
);
// 定义一个状态描述器
ValueStateDescriptor<Integer> stateDescriptor2 = new ValueStateDescriptor<Integer>(
"count-state",
Integer.class
);
// 获取状态
uidState = getRuntimeContext().getState(stateDescriptor1);
countState = getRuntimeContext().getState(stateDescriptor2);
} @Override
public void processElement(Tuple5<String, String, String, Integer, String> value, Context ctx, Collector<Tuple4<String, Integer, Integer, Integer>> out) throws Exception {
String uid = value.f0;
String aid = value.f1;
Integer type = value.f3;
//使用HashSet进行判断去重,更新uidState
HashSet<String> hashSet = uidState.value();
if(hashSet == null){
hashSet = new HashSet<>();
}
hashSet.add(uid);
uidState.update(hashSet);
// 计算人数
Integer count = countState.value();
if(count == null) {
count = 0;
}
count += 1;
countState.update(count);
out.collect(Tuple4.of(aid,type,hashSet.size(), count));
}
}).print();
env.execute();
}
}

  如果使用HashSet去重,用户实例较大,会大量消耗资源,导致性能变低,甚至内存溢出

  • 方案二:改进,使用BloomFilter存储用户的ID,BloomFilter可以判断用户一定不存在,使用的内存极少。但是使用BloomFilter没有计数器,就必须额外定义一个状态,存储去重的人数

ActivityCountAdv2

package cn._51doit.flink.day08;

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple4;
import org.apache.flink.api.java.tuple.Tuple5;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.shaded.guava18.com.google.common.hash.BloomFilter;
import org.apache.flink.shaded.guava18.com.google.common.hash.Funnels;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector; import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.util.HashSet; public class ActivityCountAdv2 { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); //u001,A1,2019-09-02 10:10:11,1,北京市
DataStreamSource<String> lines = env.socketTextStream("localhost", 8888); //对数据进行切分整理
SingleOutputStreamOperator<Tuple5<String, String, String, String, String>> tpDataStream = lines.map(new MapFunction<String, Tuple5<String, String, String, String, String>>() {
@Override
public Tuple5<String, String, String, String, String> map(String line) throws Exception {
String[] fields = line.split(",");
String uid = fields[0];
String aid = fields[1];
String time = fields[2];
String type = fields[3];
String province = fields[4];
return Tuple5.of(uid, aid, time, type, province);
}
}); //按照活动ID和事件类型分组
KeyedStream<Tuple5<String, String, String, String, String>, Tuple> keyed = tpDataStream.keyBy(1, 3); keyed.process(new KeyedProcessFunction<Tuple, Tuple5<String, String, String, String, String>, Tuple4<String, String, Integer, Integer>>() { //保存去重后用户ID的HashSet
private transient ValueState<BloomFilter> uidState; //保存用户ID去重的次数的Integer类型
private transient ValueState<Integer> uidCountState; //保存次数的Integer类型(未去重的)
private transient ValueState<Integer> countState; @Override
public void open(Configuration parameters) throws Exception {
//定义一个状态描述器
ValueStateDescriptor<BloomFilter> stateDescriptor1 = new ValueStateDescriptor<BloomFilter>(
"uid-state",
TypeInformation.of(new TypeHint<BloomFilter>(){})
); //定义一个状态描述器
ValueStateDescriptor<Integer> stateDescriptor2 = new ValueStateDescriptor<Integer>(
"count-state",
Integer.class
); //定义一个状态描述器
ValueStateDescriptor<Integer> stateDescriptor3 = new ValueStateDescriptor<Integer>(
"uid-count-state",
Integer.class
);
//获取状态
//获取状态
uidState = getRuntimeContext().getState(stateDescriptor1);
countState = getRuntimeContext().getState(stateDescriptor2);
uidCountState = getRuntimeContext().getState(stateDescriptor3);
} @Override
public void processElement(Tuple5<String, String, String, String, String> value, Context ctx, Collector<Tuple4<String, String, Integer, Integer>> out) throws Exception {
String uid = value.f0;
String aid = value.f1;
String type = value.f3;
//使用HashSet进行判断去重
BloomFilter bloomFilter = uidState.value();
Integer uidCount = uidCountState.value(); //人数
Integer count = countState.value(); //次数
if(count == null) {
count = 0;
}
if(bloomFilter == null) {
bloomFilter = BloomFilter.create(Funnels.unencodedCharsFunnel(), 10000000);
uidCount = 0;
}
if(!bloomFilter.mightContain(uid)) {
bloomFilter.put(uid); //添加到BloomFilter中
uidCount += 1;
}
count += 1;
countState.update(count);
uidState.update(bloomFilter);
uidCountState.update(uidCount);
out.collect(Tuple4.of(aid, type, uidCount, count));
}
}).print(); env.execute(); }
}

 2. 活动指标多维度统计

  此处要进行多次key操作(一中维度就需要keyBy一次),相当繁琐。此处是通过将数据存入redis,所以不需要使用flink中的state,具体见代码

ActivityCountWithMultiDimension

package cn._51doit.flink.day08;

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import sun.awt.geom.AreaOp; public class ActivityCountWithMultiDimension { public static void main(String[] args) throws Exception{ ParameterTool parameters = ParameterTool.fromPropertiesFile(args[0]); StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.getConfig().setGlobalJobParameters(parameters); //u001,A1,2019-09-02 10:10:11,1,北京市
DataStreamSource<String> lines = env.socketTextStream("localhost", 8888); SingleOutputStreamOperator<ActivityBean> beanStream = lines.map(new MapFunction<String, ActivityBean>() { @Override
public ActivityBean map(String line) throws Exception {
String[] fields = line.split(",");
String uid = fields[0];
String aid = fields[1];
String date = fields[2].split(" ")[0];
String type = fields[3];
String province = fields[4];
return ActivityBean.of(uid, aid, date, type, province);
}
}); SingleOutputStreamOperator<ActivityBean> res1 = beanStream.keyBy("aid", "type").sum("count"); SingleOutputStreamOperator<ActivityBean> res2 = beanStream.keyBy("aid", "type", "date").sum("count"); SingleOutputStreamOperator<ActivityBean> res3 = beanStream.keyBy("aid", "type", "date", "province").sum("count"); res1.map(new MapFunction<ActivityBean, Tuple3<String, String, String>>() {
@Override
public Tuple3<String, String, String> map(ActivityBean value) throws Exception {
return Tuple3.of(Constant.ACTIVITY_COUNT +"-"+ value.aid, value.type, value.count.toString());
}
}).addSink(new MyRedisSink()); res2.map(new MapFunction<ActivityBean, Tuple3<String, String, String>>() {
@Override
public Tuple3<String, String, String> map(ActivityBean value) throws Exception {
return Tuple3.of(Constant.DAILY_ACTIVITY_COUNT + "-" + value.aid + "-" + value.date, value.type, value.count.toString());
}
}).addSink(new MyRedisSink()); res3.map(new MapFunction<ActivityBean, Tuple3<String, String, String>>() {
@Override
public Tuple3<String, String, String> map(ActivityBean value) throws Exception {
return Tuple3.of(Constant.PROVINCE_DAILY_ACTIVITY_COUNT + "-" + value.aid + "-" + value.date + "-" + value.province, value.type, value.count.toString());
}
}).addSink(new MyRedisSink()); env.execute();
}
}

Constant

package cn._51doit.flink.day08;

public class Constant {

    public static final String ACTIVITY_COUNT = "ACTIVITY_COUNT";

    public static final String DAILY_ACTIVITY_COUNT = "DAILY_ACTIVITY_COUNT";

    public static final String PROVINCE_DAILY_ACTIVITY_COUNT = "PROVINCE_DAILY_ACTIVITY_COUNT";
}

ActivityBean

package cn._51doit.flink.day08;

public class ActivityBean {

    public String uid;

    public String aid;

    public String date;

    public String type;

    public String province;

    public Long count = 1L;

    public ActivityBean() {}

    public ActivityBean(String uid, String aid, String date, String type, String province) {
this.uid = uid;
this.aid = aid;
this.date = date;
this.type = type;
this.province = province;
} public static ActivityBean of(String uid, String aid, String date, String type, String province) {
return new ActivityBean(uid, aid, date, type, province);
} @Override
public String toString() {
return "ActivityBean{" +
"uid='" + uid + '\'' +
", aid='" + aid + '\'' +
", date='" + date + '\'' +
", type='" + type + '\'' +
", province='" + province + '\'' +
'}';
}
}

MyRedisSink

package cn._51doit.flink.day08;

import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction;
import redis.clients.jedis.Jedis; public class MyRedisSink extends RichSinkFunction<Tuple3<String, String, String>> { private transient Jedis jedis; @Override
public void open(Configuration parameters) throws Exception {
ParameterTool params = (ParameterTool) getRuntimeContext()
.getExecutionConfig()
.getGlobalJobParameters();
String host = params.getRequired("redis.host");
String password = params.getRequired("redis.password");
int port = params.getInt("redis.port", 6379);
int db = params.getInt("redis.db", 0);
Jedis jedis = new Jedis(host, port);
jedis.auth(password);
jedis.select(db);
this.jedis = jedis;
} @Override
public void invoke(Tuple3<String, String, String> value, Context context) throws Exception {
if (!jedis.isConnected()) {
jedis.connect();
}
jedis.hset(value.f0, value.f1, value.f2);
} @Override
public void close() throws Exception {
jedis.close();
}
}

flink-----实时项目---day04-------1. 案例:统计点击、参与某个活动的人数和次数 2. 活动指标多维度统计(自定义redisSink)的更多相关文章

  1. 5.Flink实时项目之业务数据准备

    1. 流程介绍 在上一篇文章中,我们已经把客户端的页面日志,启动日志,曝光日志分别发送到kafka对应的主题中.在本文中,我们将把业务数据也发送到对应的kafka主题中. 通过maxwell采集业务数 ...

  2. 9.Flink实时项目之订单宽表

    1.需求分析 订单是统计分析的重要的对象,围绕订单有很多的维度统计需求,比如用户.地区.商品.品类.品牌等等.为了之后统计计算更加方便,减少大表之间的关联,所以在实时计算过程中将围绕订单的相关数据整合 ...

  3. 7.Flink实时项目之独立访客开发

    1.架构说明 在上6节当中,我们已经完成了从ods层到dwd层的转换,包括日志数据和业务数据,下面我们开始做dwm层的任务. DWM 层主要服务 DWS,因为部分需求直接从 DWD 层到DWS 层中间 ...

  4. 10.Flink实时项目之订单维度表关联

    1. 维度查询 在上一篇中,我们已经把订单和订单明细表join完,本文将关联订单的其他维度数据,维度关联实际上就是在流中查询存储在 hbase 中的数据表.但是即使通过主键的方式查询,hbase 速度 ...

  5. 11.Flink实时项目之支付宽表

    支付宽表 支付宽表的目的,最主要的原因是支付表没有到订单明细,支付金额没有细分到商品上, 没有办法统计商品级的支付状况. 所以本次宽表的核心就是要把支付表的信息与订单明细关联上. 解决方案有两个 一个 ...

  6. 3.Flink实时项目之流程分析及环境搭建

    1. 流程分析 前面已经将日志数据(ods_base_log)及业务数据(ods_base_db_m)发送到kafka,作为ods层,接下来要做的就是通过flink消费kafka 的ods数据,进行简 ...

  7. 4.Flink实时项目之数据拆分

    1. 摘要 我们前面采集的日志数据已经保存到 Kafka 中,作为日志数据的 ODS 层,从 kafka 的ODS 层读取的日志数据分为 3 类, 页面日志.启动日志和曝光日志.这三类数据虽然都是用户 ...

  8. 6.Flink实时项目之业务数据分流

    在上一篇文章中,我们已经获取到了业务数据的输出流,分别是dim层维度数据的输出流,及dwd层事实数据的输出流,接下来我们要做的就是把这些输出流分别再流向对应的数据介质中,dim层流向hbase中,dw ...

  9. 1.Flink实时项目前期准备

    1.日志生成项目 日志生成机器:hadoop101 jar包:mock-log-0.0.1-SNAPSHOT.jar gmall_mock ​ |----mock_common ​ |----mock ...

随机推荐

  1. Ubuntu 16.04 菜单栏 换位置 挪到左边 挪到下边

    Ubuntu菜单栏的位置可以调 到左侧 或者底部 调整到底部 $ gsettings set com.canonical.Unity.Launcher launcher-position Bottom ...

  2. palindrome-partitioning leetcode C++

    Given a string s, partition s such that every substring of the partition is a palindrome. Return all ...

  3. Linux高级命令及mysql数据安装

    Linux系列--高级命令--mysql数据库安装 数据库是用来组织.存储和管理数据的仓库 1.安装数据库:执行命令yum -y install mysql -server 2.启动数据库:安装完毕, ...

  4. .Net Core微服务——网关(1):ocelot集成及介绍

    网关是什么 简单来说,网关就是暴露给外部的请求入口.就和门卫一样,外面的人想要进来,必须要经过门卫.当然,网关并不一定是必须的,后端服务通过http也可以很好的向客户端提供服务.但是对于业务复杂.规模 ...

  5. 【高并发】深入解析Callable接口

    大家好,我是冰河~~ 本文纯干货,从源码角度深入解析Callable接口,希望大家踏下心来,打开你的IDE,跟着文章看源码,相信你一定收获不小. 1.Callable接口介绍 Callable接口是J ...

  6. 简单理解函数声明(以signal函数为例)

    这两天遇到一些声明比较复杂的函数,比如signal函数,那我们先简单说说signal函数的用法:(参考<c陷阱与缺陷>) [signal:几乎所有c语言程序的实现过程中都要用到signal ...

  7. Django开发 X-Frame-Options to deny 报错处理

    本博客已停更,请转自新博客查看 https://www.whbwiki.com/318.html 错误提示 Refused to display 'http://127.0.0.1:8000/inde ...

  8. Apache Kyuubi 在 T3 出行的深度实践

    支撑了80%的离线作业,日作业量在1W+ 大多数场景比 Hive 性能提升了3-6倍 多租户.并发的场景更加高效稳定 T3出行是一家基于车联网驱动的智慧出行平台,拥有海量且丰富的数据源.因为车联网数据 ...

  9. 那一天,我被Redis主从架构支配的恐惧

    面试官:要不你来讲讲你最近在看的点呗?可以拉出来一起讨论下(今天我也不知道要问什么) 候选者:最近在看「Redis」相关的内容 面试官:嗯,我记得已经问过Redis的基础和持久化了 面试官:要不你来讲 ...

  10. IDEA Plugin,写一个看股票指数和K线的插件

    作者:小傅哥 博客:https://bugstack.cn 沉淀.分享.成长,让自己和他人都能有所收获! 一.前言 没招了,不写点刺激的,你总是不好好看! 以前,我不懂.写的技术就是技术内容,写的场景 ...