HAWQ取代传统数仓实践(十一)——维度表技术之维度合并
有一种合并维度的情况,就是本来属性相同的维度,因为某种原因被设计成重复的维度属性。例如,在销售订单示例中,随着数据仓库中维度的增加,我们会发现有些通用的数据存在于多个维度中。客户维度的客户地址相关信息、送货地址相关信息里都有邮编、城市和省份。下面说明如何把客户维度里的两个邮编相关信息合并到一个新的维度中。
一、修改数据仓库表结构
为了合并维度,需要改变数据仓库表结构。图1显示了修改后的结构。新增了一个zip_code_dim邮编信息维度表,sales_order_fact事实表的结构也做了相应的修改。
zip_code_dim维度表与销售订单事实表相关联。这个关系替换了事实表与客户维度的关系。sales_order_fact表需要两个关系,一个关联到客户地址邮编,另一个关联到送货地址邮编,相应的增加了两个外键字段。假设邮编相关信息不会修改,因此zip_code_dim表中没有是否删除、版本号、生效日期等SCD属性。
下面的脚本用于修改数据仓库模式,所做的修改如下。
- 创建邮编维度表zip_code_dim。
- 初始装载邮编相关数据。
- 基于zip_code_dim表创建v_customer_zip_code_dim和v_shipping_zip_code_dim视图。
- 在sales_order_fact表上增加customer_zip_code_sk和shipping_zip_code_sk列。
- 基于已有的客户邮编和送货邮编初始装载两个邮编代理键。
- 在customer_dim表上删除客户和送货邮编及其它们的城市和州列。
- 在pa_customer_dim上删除客户的城市、州和邮编列。
set search_path=tds; -- 建立邮编维度表 create table zip_code_dim ( zip_code_sk serial, zip_code int, city varchar(30), state varchar(2) ); comment on table zip_code_dim is '邮编维度表'; comment on column zip_code_dim.zip_code_sk is '邮编维度代理键'; comment on column zip_code_dim.zip_code is '邮编'; comment on column zip_code_dim.city is '城市'; comment on column zip_code_dim.state is '省份'; -- 初始装载邮编相关数据 insert into zip_code_dim (zip_code, city, state) select distinct * from (select customer_zip_code, customer_city, customer_state from customer_dim where customer_zip_code is not null union all select shipping_zip_code, shipping_city, shipping_state from customer_dim where shipping_zip_code is not null) t1; -- 创建视图 create view v_customer_zip_code_dim (customer_zip_code_sk, customer_zip_code, customer_city, customer_state) as select * from zip_code_dim; create view v_shipping_zip_code_dim (shipping_zip_code_sk, shipping_zip_code, shipping_city, shipping_state) as select * from zip_code_dim; -- 添加邮编代理键 alter table sales_order_fact add column customer_zip_code_sk int default null; alter table sales_order_fact add column shipping_zip_code_sk int default null; comment on column sales_order_fact.customer_zip_code_sk is '客户邮编代理键'; comment on column sales_order_fact.shipping_zip_code_sk is '送货邮编代理键'; -- 初始装载两个邮编代理键 create table sales_order_fact_bak as select * from sales_order_fact; truncate table sales_order_fact; insert into sales_order_fact select t1.order_number, t1.customer_sk, t1.product_sk, t1.order_date_sk, t1.year_month, t1.order_amount, t1.order_quantity, t1.request_delivery_date_sk, t1.sales_order_attribute_sk, t2.customer_zip_code_sk, t3.shipping_zip_code_sk from sales_order_fact_bak t1 left join (select a.order_number order_number,c.customer_zip_code_sk customer_zip_code_sk from sales_order_fact_bak a, customer_dim b, v_customer_zip_code_dim c where a.customer_sk = b.customer_sk and b.customer_zip_code = c.customer_zip_code) t2 on t1.order_number = t2.order_number left join (select a.order_number order_number,c.shipping_zip_code_sk shipping_zip_code_sk from sales_order_fact_bak a, customer_dim b, v_shipping_zip_code_dim c where a.customer_sk = b.customer_sk and b.shipping_zip_code = c.shipping_zip_code) t3 on t1.order_number = t3.order_number; drop table sales_order_fact_bak; -- 在customer_dim表上删除客户和送货邮编及其它们的城市和州列。 alter table customer_dim drop column customer_zip_code cascade; alter table customer_dim drop column customer_city; alter table customer_dim drop column customer_state; alter table customer_dim drop column shipping_zip_code; alter table customer_dim drop column shipping_city; alter table customer_dim drop column shipping_state; alter table pa_customer_dim drop column customer_zip_code; alter table pa_customer_dim drop column customer_city; alter table pa_customer_dim drop column customer_state; alter table pa_customer_dim drop column shipping_zip_code; alter table pa_customer_dim drop column shipping_city; alter table pa_customer_dim drop column shipping_state; -- 重建相关视图 create or replace view v_customer_dim_latest as select customer_sk, customer_number, customer_name, customer_street_address, version, effective_date, shipping_address from (select distinct on (customer_number) customer_number, customer_sk, customer_name, customer_street_address, isdelete, version, effective_date, shipping_address from customer_dim order by customer_number, customer_sk desc) as latest where isdelete is false; create or replace view v_customer_dim_his as select *, date(lead(effective_date,1,date '2200-01-01') over (partition by customer_number order by effective_date)) expiry_date from customer_dim; create or replace view v_pa_customer_dim_latest as select customer_sk, customer_number, customer_name, customer_street_address, version, effective_date, shipping_address from (select distinct on (customer_number) customer_number, customer_sk, customer_name, customer_street_address, isdelete, version, effective_date, shipping_address from pa_customer_dim order by customer_number, customer_sk desc) as latest where isdelete is false; create or replace view v_pa_customer_dim_his as select *, date(lead(effective_date,1,date '2200-01-01') over (partition by customer_number order by effective_date)) expiry_date from pa_customer_dim;
说明:
- 邮编维度的初始数据是从客户维度表中来,这只是为了演示数据装载的过程。客户的邮编信息很可能覆盖不到所有邮编,所以更好的方法是装载一个完整的邮编信息表。由于客户地址和送货地址可能存在交叉的情况,因此使用distinct去重。送货地址的三个字段是后加的,在此之前数据的送货地址为空,邮编维度表中不能含有NULL值,所以要加上where shipping_zip_code is not null过滤条件去除邮编信息为NULL的数据行。
- 基于邮编维度表创建客户邮编和送货邮编视图,分别用作两个地理信息的角色扮演维度。
- 把数据备份表sales_order_fact_bak中的数据装载回销售订单事实表,同时需要关联两个邮编角色维度视图,查询出两个代理键,装载到事实表中。注意老的事实表与新的邮编维度表是通过客户维度表关联起来的,所以在子查询中需要三表连接,然后用两个左外连接查询出所有原事实表数据,装载到新的增加了邮编维度代理键的事实表中。
- 在customer_dim表上删除列时,需要使用cascade子句同时删除依赖它的视图,之后重建相关视图。
二、修改定期数据装载函数
定期装载函数有三个地方的修改:
- 删除客户维度装载里所有邮编信息相关的列,因为客户维度里不再有客户邮编和送货邮编相关信息。
- 在事实表中引用客户邮编视图和送货邮编视图中的代理键。
- 修改pa_customer_dim装载,因为需要从销售订单事实表的customer_zip_code_sk获取客户邮编。
修改后的fn_regular_load函数如下。
create or replace function fn_regular_load () returns void as $$ declare -- 设置scd的生效时间 v_cur_date date := current_date; v_pre_date date := current_date - 1; v_last_load date; begin -- 分析外部表 analyze ext.customer; analyze ext.product; analyze ext.sales_order; -- 将外部表数据装载到原始数据表 truncate table rds.customer; truncate table rds.product; insert into rds.customer select * from ext.customer; insert into rds.product select * from ext.product; insert into rds.sales_order select order_number, customer_number, product_code, order_date, entry_date, order_amount, order_quantity, request_delivery_date, verification_ind, credit_check_flag, new_customer_ind, web_order_flag from ext.sales_order; -- 分析rds模式的表 analyze rds.customer; analyze rds.product; analyze rds.sales_order; -- 设置cdc的上限时间 select last_load into v_last_load from rds.cdc_time; truncate table rds.cdc_time; insert into rds.cdc_time select v_last_load, v_cur_date; -- 装载客户维度 insert into tds.customer_dim (customer_number, customer_name, customer_street_address, shipping_address, isdelete, version, effective_date) select case flag when 'D' then a_customer_number else b_customer_number end customer_number, case flag when 'D' then a_customer_name else b_customer_name end customer_name, case flag when 'D' then a_customer_street_address else b_customer_street_address end customer_street_address, case flag when 'D' then a_shipping_address else b_shipping_address end shipping_address, case flag when 'D' then true else false end isdelete, case flag when 'D' then a_version when 'I' then 1 else a_version + 1 end v, v_pre_date from (select a.customer_number a_customer_number, a.customer_name a_customer_name, a.customer_street_address a_customer_street_address, a.shipping_address a_shipping_address, a.version a_version, b.customer_number b_customer_number, b.customer_name b_customer_name, b.customer_street_address b_customer_street_address, b.shipping_address b_shipping_address, case when a.customer_number is null then 'I' when b.customer_number is null then 'D' else 'U' end flag from v_customer_dim_latest a full join rds.customer b on a.customer_number = b.customer_number where a.customer_number is null -- 新增 or b.customer_number is null -- 删除 or (a.customer_number = b.customer_number and not (coalesce(a.customer_name,'') = coalesce(b.customer_name,'') and coalesce(a.customer_street_address,'') = coalesce(b.customer_street_address,'') and coalesce(a.shipping_address,'') = coalesce(b.shipping_address,'') ))) t order by coalesce(a_customer_number, 999999999999), b_customer_number limit 999999999999; -- 装载产品维度 insert into tds.product_dim (product_code, product_name, product_category, isdelete, version, effective_date) select case flag when 'D' then a_product_code else b_product_code end product_code, case flag when 'D' then a_product_name else b_product_name end product_name, case flag when 'D' then a_product_category else b_product_category end product_category, case flag when 'D' then true else false end isdelete, case flag when 'D' then a_version when 'I' then 1 else a_version + 1 end v, v_pre_date from (select a.product_code a_product_code, a.product_name a_product_name, a.product_category a_product_category, a.version a_version, b.product_code b_product_code, b.product_name b_product_name, b.product_category b_product_category, case when a.product_code is null then 'I' when b.product_code is null then 'D' else 'U' end flag from v_product_dim_latest a full join rds.product b on a.product_code = b.product_code where a.product_code is null -- 新增 or b.product_code is null -- 删除 or (a.product_code = b.product_code and not (a.product_name = b.product_name and a.product_category = b.product_category))) t order by coalesce(a_product_code, 999999999999), b_product_code limit 999999999999; -- 装载销售订单事实表 insert into sales_order_fact select a.order_number, customer_sk, product_sk, e.date_sk, e.year * 100 + e.month, order_amount, order_quantity, f.date_sk, g.sales_order_attribute_sk, h.customer_zip_code_sk, i.shipping_zip_code_sk from rds.sales_order a, v_customer_dim_his c, v_product_dim_his d, date_dim e, date_dim f, sales_order_attribute_dim g, v_customer_zip_code_dim h, v_shipping_zip_code_dim i, rds.customer j, rds.cdc_time k where a.customer_number = c.customer_number and a.order_date >= c.effective_date and a.order_date < c.expiry_date and a.product_code = d.product_code and a.order_date >= d.effective_date and a.order_date < d.expiry_date and date(a.order_date) = e.date and date(a.request_delivery_date) = f.date and a.verification_ind = g.verification_ind and a.credit_check_flag = g.credit_check_flag and a.new_customer_ind = g.new_customer_ind and a.web_order_flag = g.web_order_flag and a.customer_number = j.customer_number and j.customer_zip_code = h.customer_zip_code and j.shipping_zip_code = i.shipping_zip_code and a.entry_date >= k.last_load and a.entry_date < k.current_load; -- 重载PA客户维度 truncate table pa_customer_dim; insert into pa_customer_dim select distinct a.* from customer_dim a, sales_order_fact b, v_customer_zip_code_dim c where c.customer_state = 'pa' and b.customer_zip_code_sk = c.customer_zip_code_sk and a.customer_sk = b.customer_sk; -- 分析tds模式的表 analyze customer_dim; analyze product_dim; analyze sales_order_fact; analyze pa_customer_dim; -- 更新时间戳表的last_load字段 truncate table rds.cdc_time; insert into rds.cdc_time select v_cur_date, v_cur_date; end; $$ language plpgsql;
上面的函数需要注意两个地方。装载事实表数据时,除了关联两个邮编维度视图外,还要关联过渡区的rds.customer表。这是因为要取得邮编维度代理键,必须连接邮编代码字段,而邮编代码已经从客户维度表中删除,只有在源数据的客户表中保留。第二个改变是PA子维度的装载。州代码已经从客户维度表删除,被放到了新的邮编维度表中,而客户维度和邮编维度并没有直接关系,它们是通过事实表的客户代理键和邮编代理键产生联系,因此必须关联事实表、客户维度表、邮编维度表三个表才能取出PA子维度数据。这也就是把PA子维度的装载放到了事实表装载之后的原因。
三、测试
按照以下步骤测试修改后的定期装载脚本。
- 对源数据的客户邮编相关信息做一些修改。
- 装载新的客户数据前,查询最后的客户和送货邮编,后面可以用改变后的信息和此查询的输出作对比。
- 新增销售订单源数据。
- 执行定期装载。
- 查询客户维度表、售订单事实表和PA子维度表,确认数据已经正确装载。
执行下面的语句,对源数据的客户信息做以下两处修改:客户编号4的客户和送货邮编信息;新增一个编号15的客户。
update source.customer set customer_street_address = '9999 louise dr.', customer_zip_code = 17055, customer_city = 'pittsburgh', shipping_address = '9999 louise dr.', shipping_zip_code = 17055, shipping_city = 'pittsburgh' where customer_number = 4; insert into source.customer values(15, 'super stores', '1000 woodland st.', 17055, 'pittsburgh', 'pa', '1000 woodland st.', 17055, 'pittsburgh', 'pa'); commit;
现在在装载新的客户数据前查询最后的客户和送货邮编。后面可以用改变后的信息和此查询的输出作对比。查询语句如下。
select order_date_sk odsk, customer_number cn, customer_zip_code czc, shipping_zip_code szc from v_customer_zip_code_dim a, v_shipping_zip_code_dim b, sales_order_fact c, customer_dim d where a.customer_zip_code_sk = c.customer_zip_code_sk and b.shipping_zip_code_sk = c.shipping_zip_code_sk and d.customer_sk = c.customer_sk; order by odsk;
然后使用下面的语句新增两条销售订单。
set @order_date := from_unixtime(unix_timestamp('2017-05-30 00:00:01') + rand() * (unix_timestamp('2017-05-30 12:00:00') - unix_timestamp('2017-05-30 00:00:01'))); set @request_delivery_date := from_unixtime(unix_timestamp(date_add(current_date, interval 5 day)) + rand() * 86400); set @amount := floor(1000 + rand() * 9000); set @quantity := floor(10 + rand() * 90); insert into source.sales_order values (null, 4, 3, 'y', 'y', 'y', 'n', @order_date, @request_delivery_date, @order_date, @amount, @quantity); set @order_date := from_unixtime(unix_timestamp('2017-05-30 12:00:00') + rand() * (unix_timestamp('2017-05-31 00:00:00') - unix_timestamp('2017-05-30 12:00:00'))); set @request_delivery_date := from_unixtime(unix_timestamp(date_add(current_date, interval 5 day)) + rand() * 86400); set @amount := floor(1000 + rand() * 9000); set @quantity := floor(10 + rand() * 90); insert into source.sales_order values (null, 15, 4, 'y', 'n', 'y', 'n', @order_date, @request_delivery_date, @order_date, @amount, @quantity); commit;
执行下面的命令定期装载。
~/regular_etl.sh
查询customer_dim表,确认两个改变的客户,即编号4和15的客户,已经正确装载。
select customer_sk csk, customer_number cnum, customer_name cnam, customer_street_address csd, shipping_address sd, version, effective_date, expiry_date from v_customer_dim_his where customer_number in (4, 15);
查询结果如图2所示。
查询sales_order_fact表里的两条新销售订单,确认邮编已经正确装载。
select a.order_number onum, e.customer_number cnum, b.customer_zip_code czc, c.shipping_zip_code szc, f.product_code pc, d.order_date od, a.order_amount, a.order_quantity from sales_order_fact a, v_customer_zip_code_dim b, v_shipping_zip_code_dim c, v_order_date_dim d, customer_dim e, product_dim f where a.customer_sk = e.customer_sk and a.product_sk = f.product_sk and a.customer_zip_code_sk = b.customer_zip_code_sk and a.shipping_zip_code_sk = c.shipping_zip_code_sk and a.order_date_sk = d.order_date_sk order by a.order_number desc limit 2;
查询结果如图3所示。
查询v_pa_customer_dim_his视图,确认PA客户正确装载。
select customer_sk csk, customer_number cnum, customer_name cnam, customer_street_address csa, shipping_address sad, version, effective_date, expiry_date from v_pa_customer_dim_his order by customer_sk;
查询结果如图4所示。
HAWQ取代传统数仓实践(十一)——维度表技术之维度合并的更多相关文章
- HAWQ取代传统数仓实践(十八)——层次维度
一.层次维度简介 大多数维度都具有一个或多个层次.例如,示例数据仓库中的日期维度就有一个四级层次:年.季度.月和日.这些级别用date_dim表里的列表示.日期维度是一个单路径层次,因为除了年-季度- ...
- HAWQ取代传统数仓实践(十九)——OLAP
一.OLAP简介 1. 概念 OLAP是英文是On-Line Analytical Processing的缩写,意为联机分析处理.此概念最早由关系数据库之父E.F.Codd于1993年提出.OLAP允 ...
- HAWQ取代传统数仓实践(十六)——事实表技术之迟到的事实
一.迟到的事实简介 数据仓库通常建立于一种理想的假设情况下,这就是数据仓库的度量(事实记录)与度量的环境(维度记录)同时出现在数据仓库中.当同时拥有事实记录和正确的当前维度行时,就能够从容地首先维护维 ...
- HAWQ取代传统数仓实践(十三)——事实表技术之周期快照
一.周期快照简介 周期快照事实表中的每行汇总了发生在某一标准周期,如一天.一周或一月的多个度量.其粒度是周期性的时间段,而不是单个事务.周期快照事实表通常包含许多数据的总计,因为任何与事实表时间范围一 ...
- HAWQ取代传统数仓实践(十)——维度表技术之杂项维度
一.什么是杂项维度 简单地说,杂项维度就是一种包含的数据具有很少可能值的维度.事务型商业过程通常产生一系列混杂的.低基数的标志位或状态信息.与其为每个标志或属性定义不同的维度,不如建立单独的将不同维度 ...
- HAWQ取代传统数仓实践(七)——维度表技术之维度子集
有些需求不需要最细节的数据.例如更想要某个月的销售汇总,而不是某天的数据.再比如相对于全部的销售数据,可能对某些特定状态的数据更感兴趣等.此时事实数据需要关联到特定的维度,这些特定维度包含在从细节维度 ...
- HAWQ取代传统数仓实践(十二)——维度表技术之分段维度
一.分段维度简介 在客户维度中,最具有分析价值的属性就是各种分类,这些属性的变化范围比较大.对某个个体客户来说,可能的分类属性包括:性别.年龄.民族.职业.收入和状态,例如,新客户.活跃客户.不活跃客 ...
- HAWQ取代传统数仓实践(十五)——事实表技术之无事实的事实表
一.无事实事实表简介 在多维数据仓库建模中,有一种事实表叫做"无事实的事实表".普通事实表中,通常会保存若干维度外键和多个数字型度量,度量是事实表的关键所在.然而在无事实的事实表中 ...
- HAWQ取代传统数仓实践(八)——维度表技术之角色扮演维度
单个物理维度可以被事实表多次引用,每个引用连接逻辑上存在差异的角色维度.例如,事实表可以有多个日期,每个日期通过外键引用不同的日期维度,原则上每个外键表示不同的日期维度视图,这样引用具有不同的含义.这 ...
随机推荐
- http的请求流程
# !/usr/bin/env python # coding:utf-8 import socket def handle_request(client): buf = client.recv(10 ...
- day3-set集合
set是一个无序且不重复的元素集合 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 3 ...
- delphi webbrowser post自动登录
delphi webbrowser post自动登录 var EncodedDataString: WideString; PostData: OleVariant; Headers: ...
- 扯一扯 C#委托和事件?策略模式?接口回调?
早前学习委托的时候,写过一点东西,今天带着新的思考和认知,再记点东西.这篇文章扯到设计模式中的策略模式,观察者模式,还有.NET的特性之一--委托.真的,请相信我,我只是在扯淡...... 场景练习 ...
- Java Web专题
- HandlerSocket ---MySQL与NoSQL ---SQL与NoSQL的融合(转)
项目地址:https://github.com/ahiguti/HandlerSocket-Plugin-for-MySQL 写这一篇内容的原因是MySQL5.6.2突然推出了memcached的功能 ...
- Django 函数和方法的区别
函数和方法的区别 1.函数要手动传self,方法不用传 2.如果是一个函数,用类名去调用,如果是一个方法,用对象去调用 class Foo(object): def __init__(self): s ...
- EasyUI:获取某个dategrid的所有行数据
EasyUI:获取某个dategrid的所有行数据 var rows = $("#grid").datagrid("getRows"); for(var i=0 ...
- kali安装后配置
0x00.安装Vmware Tools 由于是在VMware Workstation里面安装的,所以需要首先安装VMware tools工具方便我们Ctrl+C和Ctrl+V,步骤如下: 在VMWar ...
- NOIP 货车运输
题目描述 Description A 国有 n 座城市,编号从 1 到 n,城市之间有 m 条双向道路.每一条道路对车辆都有重量限制,简称限重.现在有 q 辆货车在运输货物,司机们想知道每辆车在不超过 ...