DataCleaner第一章
Part1. Introduction to DataCleaner 介绍DataCleaner
- |--What is data quality(DQ) 数据质量?
- |--What is data profiling? 数据分析?
- |--What is datastore? 数据存储?
- Composite datastore 综合性数据存储
- |--What is data monitoring? 数据监控?
- |--What is master data management(MDM)? 主数据管理?
What is data quality (DQ)?
Data Quality (DQ) is a concept and a business term covering the quality of the data used for a particular purpose. Often times the DQ term is applied to the quality of data used
数据质量即使一种概念又是一种用于说明特定目的包含质量数据的商业术语。很多时间DQ术语被应用到商业决策上,
in business decisions but it may also refer to the quality of data used in research, campaigns, processes and more.
但是也值得是质量数据被应用到研究、质量活动,流程等等。
Working with Data Quality typically varies a lot from project to project, just as the issues in the quality of data vary a lot. Examples of data quality issues include:
处理数据质量通常会随着项目和项目的不同而变化,就像数据质量的问题会有很大的不同。数据质量的问题主要有:
- Completeness of data 数据的完整性
- Correctness of data 数据的正确性
- Duplication of data 重复的数据
- Uniformedness/standardization of data 数据的标准性
A less technical definition of high-quality data is, that data are of high quality "if they are fit for their intended uses in operations, decision making and planning" (J. M. Juran).
对高质量数据的一个不太技术性的定义是,数据具有高质量,“如果它们适合于其在运营、决策和规划方面的预期用途”(J. M. Juran)。
Data quality analysis (DQA) is the (human) process of examining the quality of data for a particular process or organization. The DQA includes both technical and non-technical
数据质量分析(DQA)是对特定过程或组织的数据质量进行检查的过程。数据质量分析包括的技术元素和非技术元素。
elements. For example, to do a good DQA you will probably need to talk to users, business people, partner organizations and maybe customers.
例如,要做一个好的DQA,您可能需要与用户、业务人员、伙伴组织和可能的客户交谈。
This is needed to asses what the goal of the DQA should be.
这是用来评估DQA目标的必要的。
From a technical viewpoint the main task in a DQA is the data profiling activity, which will help you discover and measure the current state of affairs in the data.
从技术角度来看,DQA中的主要任务是数据分析活动,它将帮助您发现和度量数据中的当前状态。
What is data profiling?
Data profiling is the activity of investigating a datastore to create a 'profile' of it. With a profile of your datastore you will be a lot better equipped to actually use and improve it.
数据分析是对数据存储进行调查以创建它的“概要”的活动。有了您的数据存储的概要,您将会有更好的去实际使用和改进它。
The way you do profiling often depends on whether you already have some ideas about the quality of the data or if you're not experienced with the datastore at hand. Either
您进行分析的方式通常取决于您是否已经对数据的质量有了一些想法,或者您是否对datastore没有经验。
way we recommend an explorative approach, because even though you think there are only a certain amount of issues you need to look for, it is our experience (and reasoning behind a lot of the features of DataCleaner) that it is just as important to check those items in the data that you think are correct!
无论哪种方式,我们都建议采用一种探索性的方法,因为即使您认为您需要查找的问题只有一定数量,但这是我们的经验(并且在数据收集者的许多特性后面进行推理),在您认为正确的数据中检查这些项同样重要!
Typically it's cheap to include a bit more data into your analysis and the results just might surprise you and save you time!
通常,在你的分析中包含更多的数据是没有价值的,结果可能会让你大吃一惊,节省你的时间!
DataCleaner comprises (amongst other aspects) a desktop application for doing data profiling on just about any kind of datastore.
DataCleaner包括(在其他方面)一个桌面应用程序,用于对任何类型的数据存储进行数据分析。
What is a datastore?
A datastore is the place where data is stored. Usually enterprise data lives in relational databases, but there are numerous exceptions to that rule.
数据存储是存储数据的地方。通常企业数据都存在于关系数据库中,但是有许多例外情况。
To comprehend different sources of data, such as databases, spreadsheets, XML files and even standard business applications, we employ the umbrella term datastore .
由不同来源的数据组成,例如数据库、电子表格、XML文件,甚至标准的业务应用程序,我们使用的是术语数据存储。
DataCleaner is capable of retrieving data from a very wide range of datastores. And furthermore, DataCleaner can update the data of most of these datastores as well.
DataCleaner能够从非常广泛的数据存储中检索数据。此外,DataCleaner还可以更新大多数这些数据存储的数据。
A datastore can be created in the UI or via the configuration file . You can create a datastore from any type of source such as: CSV, Excel, Oracle Database, MySQL, etc.
数据存储可以在UI中创建,也可以通过配置文件创建。您可以从任何类型的源(如:CSV、Excel、Oracle数据库、MySQL等)创建数据存储。
Composite datastore
A composite datastore contains multiple datastores . The main advantage of a composite datastore is that it allows you to analyze and process data from multiple sources in the same job.
复合数据存储包含多个数据存储。复合数据存储的主要优势在于,它允许您在同一作业中分析和处理来自多个源的数据。
What is data monitoring?
We've argued that data profiling is ideally an explorative activity. Data monitoring typically isn't! The measurements that you do when profiling often times needs to be continuously checked so that your improvements are enforced through time. This is what data monitoring is typically about.
Data monitoring solutions come in different shapes and sizes. You can set up your own bulk of scheduled jobs that run every night. You can build alerts around it that send you emails if a particular measure goes beyond its allowed thresholds, or in some cases you can attempt ruling out the issue entirely by applying First-Time-Right (FTR) principles that validate data at entry-time. eg. at data registration forms and more.
As of version 3, DataCleaner now also includes a monitoring web application, dubbed "DataCleaner monitor". The monitor is a server application that supports orchestrating and scheduling of jobs, as well as exposing metrics through web services and through interactive timelines and reports. It also supports the configuration and job-building process through wizards and management pages for all the components of the solution. As such, we like to say that the DataCleaner monitor provides a good foundation for the infrastructure needed in a Master Data Management hub.
What is master data management (MDM)?
Master data management (MDM) is a very broad term and is seen materialized in a variety of ways. For the scope of this document it serves more as a context of data quality than an activity that we actually target with DataCleaner per-se.
The overall goals of MDM is to manage the important data of an organization. By "master data" we refer to "a single version of the truth", ie. not the data of a particular system, but for example all the customer data or product data of a company. Usually this data is dispersed over multiple datastores, so an important part of MDM is the process of unifying the data into a single model.
Obviously another of the very important issues to handle in MDM is the quality of data. If you simply gather eg. "all customer data" from all systems in an organization, you will most likely see a lot of data quality issues. There will be a lot of duplicate entries, there will be variances in the way that customer data is filled, there will be different identifiers and even different levels of granularity for defining "what is a customer?". In the context of MDM, DataCleaner can serve as the engine to cleanse, transform and unify data from multiple datastores into the single view of the master data.
DataCleaner第一章的更多相关文章
- 《Django By Example》第一章 中文 翻译 (个人学习,渣翻)
书籍出处:https://www.packtpub.com/web-development/django-example 原作者:Antonio Melé (译者注:本人目前在杭州某家互联网公司工作, ...
- MyBatis3.2从入门到精通第一章
第一章一.引言mybatis是一个持久层框架,是apache下的顶级项目.mybatis托管到goolecode下,再后来托管到github下.(百度百科有解释)二.概述mybatis让程序将主要精力 ...
- Nova PhoneGap框架 第一章 前言
Nova PhoneGap Framework诞生于2012年11月,从第一个版本的发布到现在,这个框架经历了多个项目的考验.一直以来我们也持续更新这个框架,使其不断完善.到现在,这个框架已比较稳定了 ...
- 第一章 MYSQL的架构和历史
在读第一章的过程中,整理出来了一些重要的概念. 锁粒度 表锁(服务器实现,忽略存储引擎). 行锁(存储引擎实现,服务器没有实现). 事务的ACID概念 原子性(要么全部成功,要么全部回滚). 一致性 ...
- 第一章 Java多线程技能
1.初步了解"进程"."线程"."多线程" 说到多线程,大多都会联系到"进程"和"线程".那么这两者 ...
- 【读书笔记】《编程珠玑》第一章之位向量&位图
此书的叙述模式是借由一个具体问题来引出的一系列算法,数据结构等等方面的技巧性策略.共分三篇,基础,性能,应用.每篇涵盖数章,章内案例都非常切实棘手,解说也生动有趣. 自个呢也是头一次接触编程技巧类的书 ...
- 《JavaScript高级程序设计(第3版)》阅读总结记录第一章之JavaScript简介
前言: 为什么会想到把<JavaScript 高级程序设计(第 3 版)>总结记录呢,之前写过一篇博客,研究的轮播效果,后来又去看了<JavaScript 高级程序设计(第3版)&g ...
- 《Entity Framework 6 Recipes》翻译系列 (1) -----第一章 开始使用实体框架之历史和框架简述
微软的Entity Framework 受到越来越多人的关注和使用,Entity Framework7.0版本也即将发行.虽然已经开源,可遗憾的是,国内没有关于它的书籍,更不用说好书了,可能是因为EF ...
- 《Entity Framework 6 Recipes》翻译系列(2) -----第一章 开始使用实体框架之使用介绍
Visual Studio 我们在Windows平台上开发应用程序使用的工具主要是Visual Studio.这个集成开发环境已经演化了很多年,从一个简单的C++编辑器和编译器到一个高度集成.支持软件 ...
随机推荐
- Java解决CSRF问题
项目地址: https://github.com/morethink/web-security-csrf CSRF是什么? CSRF(Cross-site request forgery),中文名称: ...
- strcpy和memcpy
切记,memcpy的头文件是memory.hstrcpy和memcpy主要有以下3方面的区别.1.复制的内容不同.strcpy只能复制字符串,而memcpy可以复制任意内容,例如字符数组.整型.结构体 ...
- python学习交流 - 内置函数使用方法和应用举例
内置函数 python提供了68个内置函数,在使用过程中用户不再需要定义函数来实现内置函数支持的功能.更重要的是内置函数的算法是经过python作者优化的,并且部分是使用c语言实现,通常来说使用内置函 ...
- rsync推送和拉取
rsync格式: # 拷贝本地文件.当SRC和DES路径信息都不包含有单个冒号":"分隔符时就启动这种工作模式.如:rsync -a /data /backup rsync [OP ...
- JDBC(二)
三层架构的一些基本报结构如下: domain包:下面是一些实体bean,属性为private,提供属性相对应的set和get方法.一般对应于数据库中的一张数据表,属性对应于数据表中的列. dao包,数 ...
- Android图像处理 - 高斯模糊的原理及实现
欢迎大家前往云+社区,获取更多腾讯海量技术实践干货哦~ 由 天天P图攻城狮 发布在云+社区 作者简介:damonxia(夏正冬),天天P图Android工程师 前言 高斯模糊是图像处理中几乎每个程序员 ...
- FileBeat安装配置
在ELK中因为logstash是在jvm上跑的,资源消耗比较大,对机器的要求比较高.而Filebeat是一个轻量级的logstash-forwarder,在服务器上安装后,Filebeat可以监控日志 ...
- MySQL数据类型概念
关系型数据库的特点 1,数据以表格的形式出现 2,每行为各种记录的名称 3,每列为数据名称所对应的数据域 4许多的行和列组成一张table 5若干的表单组成databases 术语 数据库:关联表的集 ...
- Hive入门教程
Hive 安装 相比起很多教程先介绍概念,我喜欢先动手装上,然后用例子来介绍概念.我们先来安装一下Hive 先确认是否已经安装了对应的yum源,如果没有照这个教程里面写的安装cdh的yum源http: ...
- Java基础(含思维导图)
很早之前整理的Java基础的一些知识点,思维导图: 1.'别名现象' 对一个对象赋值另一个对象,会指向新的对象引用,赋值前的对象引用会由于不再被引用而被gc回收: 而基本类型则不同.基本类型存储了实际 ...