使用shell进行etl数据验证

方法如下：

　　整理校验的类型，不同的类型配置文件不一样。

　　　　1：校验数据增量：需要设置表名，增量字段。

　　　　2：非法值校验：设置表名，条件，校验字段，合法值/非法值范围。

　　　　3：自定义校验：设置表名，校验名称，自定义sql。

　　参数解析：

　　　　使用特殊字符作为参数的前缀，后缀；便于在脚本中进行检测和替换。

　　所实现的脚本如下：

　　配置文件：

dm_monitor_list.conf 　　

 record dm_box_office_summary index_date

 record dm_channel_index index_date

 record dm_comment_emotion_summary

 record dm_comment_keyword_summary

 record dm_comment_meterial dt

 record dm_event_meterial index_date

 record dm_event_meterial_comment dt

 record dm_event_summary

 record dm_index index_date

 record dm_main_actor_index index_date

 record dm_movie_comment_summary index_date

 record dm_movie_wish_rating_summary dt

 record dm_voice_meterial dt

 record dm_index_date

 record dm_comment_keyword_base

 record dm_index_base

 record dm_event_meterial_base

 primary_check dm_box_office_summary select concat(movie_id,":",rating_type,":",count(1)) as row_val from dm_box_office_summary where index_date='_##dt##_' group by movie_id,rating_type having count(1) >1

 primary_check dm_channel_index select concat(movie_id,":",count(1)) as row_val from dm_channel_index where datediff('_##dt##_',index_date)=1 group by movie_id having count(1)>1

 primary_cyeck dm_box_office_summary select concat(movie_id,":",index_date,":",value) as row_val from dm_box_office_summary  where index_date='_##dt##_' and value<=0

 primary_check dm_channel_index select concat(movie_id,":",count(1)) as row_val from dm_channel_index where datediff('_##dt##_',index_date)=1 group by movie_id having count(1)>1

 primary_check dm_comment_emotion_summary select concat(movie_id,":",mood_type,":",platform_id,":",channel_id,":",index_date,":",count(1)) as row_val from dm_comment_emotion_summary group by movie_id,mood_type,platform_id,channel_id,index_date having count(1)>1

 primary_check dm_comment_keyword_summary select concat(movie_id,":",mood_type,":",keyword,":",platform_id,":",channel_id,":",index_date,":",count(1)) as row_val from dm_comment_keyword_summary group by movie_id,mood_type,keyword,platform_id,channel_id,index_date having count(1)>1

 primary_check dm_comment_meterial select concat(comment_id,":",count(1)) as row_val from dm_comment_meterial where dt="_##dt##_" group by comment_id having count(1)>1

 primary_check dm_event_meterial select concat(material_url,":",count(1)) as row_val from dm_event_meterial where index_date='_##dt##_' and index_type=1 group by material_url having count(1)>1

 primary_check dm_event_meterial_comment select concat(comment_id,":",count(1)) as row_val from dm_event_meterial_comment where dt='_##dt##_' group by comment_id having count(1)>1

 primary_check dm_event_summary select concat(event_id,":",platform_id,":",channel_id,":",index_date,":",count(1)) as row_val from dm_event_summary group by event_id,platform_id,channel_id,index_date having count(1)>1

　　脚本文件：monitor.sh

 #!/sh/bash

 # 分析表数据量状态

 # .数据的唯一性

 #   电影id唯一

 # .指标的正确行

 #   增量不能小于  ;全量表小

 # .基本状态

 ##  运算的条数、电影数量、空间

 # 日志格式为

 ## tablename    dt  check_type  value   insert_date

 ## check_type :

 ### record 记录值；

 ### movie_num :电影数量 ；

 ### space :所占空间

 ### diff: 昨天和今天的电影差异，使用01 代表今天有昨天没有  代表昨天有今天没有

 ### movie_rep:重复的电影数量

 ### index-* :代表某个指标增量的为负值

 basepath=$(cd `dirname $`;pwd);

 cd $basepath

 source /etc/profile

 source ../../etc/env.ini

 if [[ ! -f "$basepath/monitor_list.conf" ]]; then

     echo "check monitor list file not exists. system exit."

     exit

 fi

 #config

 #分区

 dt=$(date -d "-1 day" "+%Y-%m-%d")

 if [[ $# -eq  ]]; then

     dt=$(date -d "$1" "+%Y-%m-%d")

 fi

 insert_date=$(date "+%Y-%m-%d")

 file_path=$OPERATION_LOG_FILE_PATH

 log_name=monitor_data.log

 log=${file_path}/${log_name}

 cat $basepath/monitor_list.conf | while read line

 do

     check_type=`echo $line | cut -d " " -f `

     table_name=`echo $line |cut -d " " -f `

     profix=$table_name"\t"$insert_date"\t"

     if [[ $check_type == 'dw' ]];then

         DB=$HIVE_DB_DW

         hdfs_path=$HADOOP_DW_DATA_DESC

     elif [[ $check_type == 'ods' ]];then

         DB=$HIVE_DB_ODS_S

         hdfs_path=$HADOOP_ODS_S_DATA_DESC

     fi

     #record

     record=$(spark-sql -e "select count(1) from $DB.$table_name where dt = '$dt';")

     echo -e $profix"record\t"$record >> $log

     #movie_num

     if [[ $table_name == 'dw_weibo_materials' ]];then

         mtime_id="movie_id"

     elif [[ $table_name == 'g4_weibo_materiel_post' ]];then

         mtime_id='x_movie_id'

     else

         mtime_id="mtime_id"

     fi

     if [[ $table_name == 'dw_weibo_user' ]];then

         movie_num=$(hive -e "select count(1) from

             (select mtime_actor_id from $DB.$table_name where dt = '$dt' and source = 'govwb' group by mtime_actor_id) a")

     else

         movie_num=$(spark-sql -e "select count(1) from (select $mtime_id from $DB.$table_name where dt = '$dt' group by $mtime_id) a")

     fi

     echo -e $profix"movie_num\t"$movie_num >> $log

     #space

     if [[ $check_type == 'ods' ]];then

         space=$(hadoop fs -du $hdfs_path/$table_name/$dt)

     else

         space=$(hadoop fs -du $hdfs_path/$table_name/dt=$dt)

     fi

     echo -e $profix"space\t"$space>> $log

     #diff

     if [[ $table_name != 'dw_weibo_user' ]];then

         yesterday=$(date -d "-1 day $dt" "+%Y-%m-%d")

         diff=$(spark-sql -e "

             select concat_ws('|',collect_set(flag)) from (

                 select 'gf' as gf, concat_ws('=',flag,cast(count() as string)) as flag  from (

                     select concat(if(y.$mtime_id is null, , ),if(t.$mtime_id is null,,)) as flag

                     from (select distinct $mtime_id from $DB.$table_name where dt='$dt') t

                     full outer join (select distinct $mtime_id from $DB.$table_name where dt='$yesterday') y

                     on  t.$mtime_id = y.$mtime_id

                 ) a group by flag

             ) b group by gf;")

         echo -e $profix"diff\t"$diff>> $log

     fi

     #movie_rep

     if [[ $check_type == 'dw' ]];then

         movie_rep=$(spark-sql -e "

             select concat_ws('|',collect_set(v)) from (

                 select 'k' as k ,concat_ws('=',id,cast(count() as string)) as v

                 from $DB.$table_name where dt = '$dt'

                 group by id

                 having count()>

             )a  group by k;")

         echo -e $profix"movie_rep\t"$movie_rep>> $log

     fi

     #index-*

     if [[ $table_name == 'dw_comment_statistics' ]];then

         up_day=$(spark-sql -e "select concat('<0:',count(1)) from $DB.$table_name

             where dt = '$dt' and

                 (cast(up_day as int) <

                 or cast(down_day as int) <

                 or cast(vv_day as int ) <

                 or cast(cmts_day as int) <

                 );")

         echo -e $profix"index_day\t"$up_day >> $log

     fi

 done

 #dm

 args_prefix="_##"

 args_suffix="##_"

 cat $basepath/dm_monitor_list.conf | while read line

 do

     check_type=`echo $line | cut -d " " -f `

     table_name=`echo $line |cut -d " " -f `

     echo "表"$table_name

     if [[ $check_type == 'record' ]]; then

         dt_str=`echo $line |cut -d " " -f `

         echo "记录数校验 分区字段"$dt_str

     else

         custom_sql=`echo $line |cut -d " " -f -`

         echo "自定义校验"$check_type

     fi

     profix=$table_name"\t"$insert_date"\t"

     DB=$HIVE_DB_DW

     hdfs_path=$HADOOP_DW_DATA_DESC

     if [[ $check_type == 'record' ]]; then

         record_sql="select count(1) from $DB.$table_name"

         if [[ -n $dt_str ]]; then

             # if [[ $table_name == 'dm_channel_index' ]]; then

                 #     record_sql=$record_sql" where datediff('$dt',$dt_str)=1;"

             # else

             #     record_sql=$record_sql" where $dt_str = '$dt';"

             # fi

             record_sql=$record_sql" where $dt_str = '$dt';"

         else

             record_sql=$record_sql";"

         fi

         echo "执行的语句："$record_sql

         #record

         record=$(hive -e "set hive.mapred.mode = nonstrict;$record_sql")

         #record=$(spark-sql -e "$record_sql")

         echo -e $profix"$check_type\t"$record >> $log

     else

         #custom_sql

         custom_sql=${custom_sql//$args_prefix"dt"$args_suffix/$dt}

         echo "执行的语句："$custom_sql

         invalid_records=$(hive -e "set hive.mapred.mode = nonstrict;use $DB;select concat_ws(\" | \",collect_set(row_val)) from ( $custom_sql ) tmp;")

         echo $invalid_records

         if [[ ! -n $invalid_records || $invalid_records == ''  ]]; then

                 invalid_records=""

         fi

         echo -e $profix"$check_type\t"$invalid_records >> $log

     fi

 done

 # insert hive

 hadoop fs -rm -r $HADOOP_ODS_CONFIG_DATA_DESC/yq_monitor_data_log/dt=$dt

 if [ -f "${file_path}/$log_name" ]; then

     hive -e "

     ALTER TABLE $HIVE_DB_MONITOR.yq_monitor_data_log DROP IF EXISTS PARTITION (dt = '$dt');

     alter table $HIVE_DB_MONITOR.yq_monitor_data_log add partition (dt = '$dt');

     "

 fi

 cd $file_path

 hadoop fs -put $log_name $HADOOP_ODS_CONFIG_DATA_DESC/yq_monitor_data_log/dt=$dt

 mv -f $log_name /home/trash

使用shell进行etl数据验证的更多相关文章

使用 JsonPath 完成接口自动化测试中参数关联和数据验证（Python语言）
背景: 接口自动化测试实现简单.成本较低.收益较高,越来越受到企业重视 restful风格的api设计大行其道 json成为主流的轻量级数据交换格式痛点: 接口关联也称为关联参数.在应用业务接口中 ...
在kettle中实现数据验证和检查
在kettle中实现数据验证和检查在ETL项目,输入数据通常不能保证一致性.在kettle中有一些步骤能够实现数据验证或检查.验证步骤能够在一些计算的基础上验证行货字段:过滤步骤实现数据过滤:jav ...
我这么玩Web Api（二）：数据验证，全局数据验证与单元测试
目录一.模型状态 - ModelState 二.数据注解 - Data Annotations 三.自定义数据注解四.全局数据验证五.单元测试一.模型状态 - ModelState 我理解 ...
MVC 数据验证
MVC 数据验证前一篇说了MVC数据验证的例子,这次来详细说说各种各样的验证注解.System.ComponentModel.DataAnnotations 一.基础特性一.Required 必填 ...
kpvalidate开辟验证组件,通用Java Web请求服务器端数据验证组件
小菜利用工作之余编写了一款Java小插件,主要是用来验证Web请求的数据,是在服务器端进行验证,不是简单的浏览器端验证. 小菜编写的仅仅是一款非常初级的组件而已,但小菜为它写了详细的说明文档. 简单介 ...
MVC3 数据验证用法之密码验证设计思路
描述:MVC数据验证使用小结内容:display,Required,stringLength,Remote,compare,RegularExpression 本人最近在公司用mvc做了一个修改密码 ...
jQuery MiniUI开发系列之：数据验证
在开发应用系统界面时,往往需要进行很多.复杂的数据验证,当填写的数据符合规定,才能提交保存. jQuery MiniUI提供了比较完美的表单数据验证和错误显示的方式. 常见的表单控件,都有一个验证事件 ...
AngularJS快速入门指南14：数据验证
thead>tr>th, table.reference>tbody>tr>th, table.reference>tfoot>tr>th, table ...
atitit.数据验证--db数据库数据验证约束
atitit.数据验证--db数据库数据验证约束 1. 为了加强账户数据金额的安全性,需要增加验证字段..1 2. 创建帐户1 3. 更改账户2 4. ---code3 5. --fini4 1. 为 ...

随机推荐

C语言权威指南和书单 - 适用于所有级别
注:点击标题免费下载电子书所有级别 1. The C Programming Language (2nd Edition) 2. C: A Reference Manual (5th Edition ...
Unity判断用户联网状态，WiFi/移动网络/无网络
Unity判断用户联网状态本文提供全流程,中文翻译. Chinar 坚持将简单的生活方式,带给世人!(拥有更好的阅读体验 -- 高分辨率用户请根据需求调整网页缩放比例) Chinar -- 心分享. ...
Composer的入门与使用
一什么是composer composer是一种php的包管理工具, 类似于Java的maven, Ubuntu的apt等, 可以方便的解决php的包管理, 管理依赖关系等问题. 二使用compo ...
一起学python-语法
1.print 输出 2.定义变量:就是给变量赋一个值 name ='haha' print (name) 3.注释代码:# 注释快捷键:Ctrl +/ 4.单双引号: 如果字符串里面有单引号,外面就 ...
加载xib文件，如果想在初始化的时候就添加点东西就重载-(id)initWithCoder:(NSCoder *)aDecoder
- (id)initWithCoder:(NSCoder *)aDecoder { self = [super initWithCoder:aDecoder]; if (self) { self.cl ...
NABCD（团队项目）
N (Need 需求) 随着时代的进步和手机迅速发展,各种软件充斥这我们的生活,在学校里,我们总为一些各种各样的群所困扰,我们需要一件工具整合信息,让我们的生活更加便利. A (Approach 做法 ...
github 出现 Permission denied (publickey)
首先,清除所有的key-pairssh-add -Drm -r ~/.ssh删除你在github中的public-key 用下面的命令生成public key $ ssh-keygen -t rsa ...
在树莓派2或3的kali上 RCA(a/v connector)接口的正确使用方法（多图）（原创）
AV接口又称(RCA),AV接口算是出现比较早的一种接口,它由黄.白.红三种颜色的线组成,其中黄线为视频,红色为左声道,白色为右声道. ...
如何写更少的 if else
首先声明,不是要消除if 而是,减少一些不必要的if判断,使得代码更优雅,更重要的是提高可维护性 most easy use Ternary: var result = condiction? tru ...
thinkphp5.0.22远程代码执行漏洞分析及复现
虽然网上已经有几篇公开的漏洞分析文章,但都是针对5.1版本的,而且看起来都比较抽象:我没有深入分析5.1版本,但看了下网上分析5.1版本漏洞的文章,发现虽然POC都是一样的,但它们的漏洞触发原因是不同 ...

使用shell进行etl数据验证

使用shell进行etl数据验证的更多相关文章

随机推荐

热门专题