DataCleaner第一章

Part1. Introduction to DataCleaner　　介绍DataCleaner

|--What is data quality(DQ)　　数据质量？
|--What is data profiling?　　　数据分析？
|--What is datastore?　　　　数据存储？
　　Composite datastore　　综合性数据存储
|--What is data monitoring?　　数据监控？
|--What is master data management(MDM)?　　主数据管理？

What is data quality (DQ)?

Data Quality (DQ) is a concept and a business term covering the quality of the data used for a particular purpose. Often times the DQ term is applied to the quality of data used

数据质量即使一种概念又是一种用于说明特定目的包含质量数据的商业术语。很多时间DQ术语被应用到商业决策上，

in business decisions but it may also refer to the quality of data used in research, campaigns, processes and more.

但是也值得是质量数据被应用到研究、质量活动,流程等等。

Working with Data Quality typically varies a lot from project to project, just as the issues in the quality of data vary a lot. Examples of data quality issues include:

处理数据质量通常会随着项目和项目的不同而变化，就像数据质量的问题会有很大的不同。数据质量的问题主要有：

1. 1. Completeness of data　　数据的完整性　　
  2. Correctness of data　　　数据的正确性　
  3. Duplication of data　　　重复的数据
  4. Uniformedness/standardization of data　　数据的标准性

A less technical definition of high-quality data is, that data are of high quality "if they are fit for their intended uses in operations, decision making and planning" (J. M. Juran).

对高质量数据的一个不太技术性的定义是，数据具有高质量，“如果它们适合于其在运营、决策和规划方面的预期用途”(J. M. Juran)。

Data quality analysis (DQA) is the (human) process of examining the quality of data for a particular process or organization. The DQA includes both technical and non-technical

数据质量分析(DQA)是对特定过程或组织的数据质量进行检查的过程。数据质量分析包括的技术元素和非技术元素。

elements. For example, to do a good DQA you will probably need to talk to users, business people, partner organizations and maybe customers.

例如，要做一个好的DQA，您可能需要与用户、业务人员、伙伴组织和可能的客户交谈。

This is needed to asses what the goal of the DQA should be.

这是用来评估DQA目标的必要的。

From a technical viewpoint the main task in a DQA is the data profiling activity, which will help you discover and measure the current state of affairs in the data.

从技术角度来看，DQA中的主要任务是数据分析活动，它将帮助您发现和度量数据中的当前状态。

What is data profiling?

Data profiling is the activity of investigating a datastore to create a 'profile' of it. With a profile of your datastore you will be a lot better equipped to actually use and improve it.

数据分析是对数据存储进行调查以创建它的“概要”的活动。有了您的数据存储的概要，您将会有更好的去实际使用和改进它。

The way you do profiling often depends on whether you already have some ideas about the quality of the data or if you're not experienced with the datastore at hand. Either

您进行分析的方式通常取决于您是否已经对数据的质量有了一些想法，或者您是否对datastore没有经验。

way we recommend an explorative approach, because even though you think there are only a certain amount of issues you need to look for, it is our experience (and reasoning behind a lot of the features of DataCleaner) that it is just as important to check those items in the data that you think are correct!

无论哪种方式，我们都建议采用一种探索性的方法，因为即使您认为您需要查找的问题只有一定数量，但这是我们的经验(并且在数据收集者的许多特性后面进行推理)，在您认为正确的数据中检查这些项同样重要!

Typically it's cheap to include a bit more data into your analysis and the results just might surprise you and save you time!

通常，在你的分析中包含更多的数据是没有价值的，结果可能会让你大吃一惊，节省你的时间!

DataCleaner comprises (amongst other aspects) a desktop application for doing data profiling on just about any kind of datastore.

DataCleaner包括(在其他方面)一个桌面应用程序，用于对任何类型的数据存储进行数据分析。

What is a datastore?

A datastore is the place where data is stored. Usually enterprise data lives in relational databases, but there are numerous exceptions to that rule.

数据存储是存储数据的地方。通常企业数据都存在于关系数据库中，但是有许多例外情况。

To comprehend different sources of data, such as databases, spreadsheets, XML files and even standard business applications, we employ the umbrella term datastore .

由不同来源的数据组成，例如数据库、电子表格、XML文件，甚至标准的业务应用程序，我们使用的是术语数据存储。

DataCleaner is capable of retrieving data from a very wide range of datastores. And furthermore, DataCleaner can update the data of most of these datastores as well.

DataCleaner能够从非常广泛的数据存储中检索数据。此外，DataCleaner还可以更新大多数这些数据存储的数据。

A datastore can be created in the UI or via the configuration file . You can create a datastore from any type of source such as: CSV, Excel, Oracle Database, MySQL, etc.

数据存储可以在UI中创建，也可以通过配置文件创建。您可以从任何类型的源(如:CSV、Excel、Oracle数据库、MySQL等)创建数据存储。

点击注册一个新的数据存储

Composite datastore

A composite datastore contains multiple datastores . The main advantage of a composite datastore is that it allows you to analyze and process data from multiple sources in the same job.

复合数据存储包含多个数据存储。复合数据存储的主要优势在于，它允许您在同一作业中分析和处理来自多个源的数据。

What is data monitoring?

We've argued that data profiling is ideally an explorative activity. Data monitoring typically isn't! The measurements that you do when profiling often times needs to be continuously checked so that your improvements are enforced through time. This is what data monitoring is typically about.

Data monitoring solutions come in different shapes and sizes. You can set up your own bulk of scheduled jobs that run every night. You can build alerts around it that send you emails if a particular measure goes beyond its allowed thresholds, or in some cases you can attempt ruling out the issue entirely by applying First-Time-Right (FTR) principles that validate data at entry-time. eg. at data registration forms and more.

As of version 3, DataCleaner now also includes a monitoring web application, dubbed "DataCleaner monitor". The monitor is a server application that supports orchestrating and scheduling of jobs, as well as exposing metrics through web services and through interactive timelines and reports. It also supports the configuration and job-building process through wizards and management pages for all the components of the solution. As such, we like to say that the DataCleaner monitor provides a good foundation for the infrastructure needed in a Master Data Management hub.

What is master data management (MDM)?

Master data management (MDM) is a very broad term and is seen materialized in a variety of ways. For the scope of this document it serves more as a context of data quality than an activity that we actually target with DataCleaner per-se.

The overall goals of MDM is to manage the important data of an organization. By "master data" we refer to "a single version of the truth", ie. not the data of a particular system, but for example all the customer data or product data of a company. Usually this data is dispersed over multiple datastores, so an important part of MDM is the process of unifying the data into a single model.

Obviously another of the very important issues to handle in MDM is the quality of data. If you simply gather eg. "all customer data" from all systems in an organization, you will most likely see a lot of data quality issues. There will be a lot of duplicate entries, there will be variances in the way that customer data is filled, there will be different identifiers and even different levels of granularity for defining "what is a customer?". In the context of MDM, DataCleaner can serve as the engine to cleanse, transform and unify data from multiple datastores into the single view of the master data.