Chapter 2. Getting started with DataCleaner desktop
Table of Contents
|--Installing the desktop application 安装桌面应用
|--Connecting to your datastore 连接你的数据存储
|--Adding components to the job 添加组件到job中
|--Wiring components together 连接组件
-->Transformer output 转换输出
-->Filter requirement 过滤组件
-->Output data streams 输出数据流
|--Executing jobs 执行作业
|--Saving and opening jobs 保存和打开作业
|--Template jobs 模板作业
|--Wrting cleansed data to filies 将清理后的数据写入文件
Installing the desktop application
These are the system requirements of DataCleaner:
- A computer (with a graphical display, except if run in command-line mode). 电脑
- A Java Runtime Environment (JRE), version 7 or higher. JDK7或者更好版本
- A DataCleaner software license file for professional editions. If you've requested a free trial or purchased DataCleaner online, this file will have been sent to your email address. 用于专业版本的DataCleaner软件许可证文件。如果你已经要求免费试用或在网上购买DataCleaner,这个文件将会被发送到你的电子邮件地址。
Start the installation procedure using the installer program. The installer program is an executable JAR file, which is executable on most systems if you simply double-click it.
If the installer does not launch when you double-click it, open a command prompt and enter:
java -jar DataCleaner-[edition]-[version]-install.jar
Troubleshooting 故障排除
Usually the installation procedure is trivial and self-explanatory. But in case something is not working as expected, please check the following points:
On Windows systems, if you do not have Administrative privileges on the machine, we encourage you to install DataCleaner in your user's directory instead of in 'Program Files'. 在Windows系统上,如果您在机器上没有管理权限,我们鼓励您在用户的目录中安装DataCleaner,而不是在“程序文件”中。
On some Windows systems you can get a warning ' There is no script engine for file extension '.js' '. This happens when .js files (JavaScript) files are associated with an editor instead of Windows' built-in scripting engine. To resolve this issue, please refer to these help links: 在一些Windows系统上,您可以得到一个警告:“没有脚本引擎可以进行文件扩展”。js ' '。当.js文件(JavaScript)文件与一个编辑器(而不是Windows的内置脚本引擎)相关联时,就会出现这种情况。为了解决这个问题,请参考以下帮助链接: , which address the issue and recommends... , which has a fix for the issue
If you have issues with locating or selecting the software license file, you can skip the step in the installer by copying the license file manually to this folder: '~/.datacleaner' (where ~ is your User's home folder). Note that on Windows machines it is prohibited by Windows explorer to create directories starting with dot (.), but it can be done using the command prompt: 如果您有关于定位或选择软件许可文件的问题,您可以通过手动将许可文件复制到这个文件夹来跳过安装程序中的步骤:'~/。datacleaner(在这里~是您的用户的主文件夹)。注意,Windows资源管理器禁止在Windows机器上创建以dot(.)开头的目录,但是可以使用命令提示符:
mkdir .datacleaner
Connecting to your datastore 连接你的数据存储
Below is a screenshot of the initial screen that will be presented when launching DataCleaner (desktop community edition). A new datastore can be added in the "New job from scratch" or in "Manage datastores" screens available by clicking the buttons in the bottom of the screen.
File datastores can be added using a drop zone (or browse button) located in the top of the screen. Below, there are buttons that enable adding databases or cloud services.
可以使用位于屏幕顶部的drop zone(或浏览按钮)添加文件数据存储。下面是可以添加数据库或云服务的按钮。
If the file is added using the drop zone, its format will be inferred. If you need more control over how the file is going to be interpreted, use the alternative way to add a new datastore - "Manage datastores" button in the welcome screen.
如果文件是使用drop zone添加的,那么它的格式将被推断出来。如果您需要对如何解释文件进行更多的控制,请使用另一种方法在欢迎屏幕中添加一个新的datastore—“Manage datastores”按钮。
The "Datastore management" screen - except from viewing and editing existing datastores - has an option to add a new one based on its type. Choose an icon in the bottom of the screen that suits your datastore type.
Once you've registered ('created') your own datastore, you can select it from the list and (in "New job from scratch" screen) or select it from the list and click "Build job" (in "Datastore Management" screen) to start working with it!
You can also configure your datastore by means of the configuration file (conf.xml), which has both some pros and some cons. For more information, read the configuration file chapter .
Adding components to the job
There are a few different kinds of components that you can add to your job:
Analyzers , which are the most important components. Actually, without at least one analyzer the job will not run (if you execute the job without adding one, DataCleaner will suggest adding a basic one that will save the output to a file). An analyzer is a component that inspects the data that it receives and generates a result or a report. The majority of the data profiling cruft is created as analyzers. 分析器,它是最重要的组件。实际上,如果没有至少一个分析器,作业就不会运行(如果在不添加一个分析程序的情况下执行该任务,DataCleaner将建议添加一个基本的工作,它将把输出保存到文件中)。分析器是检查接收到的数据并生成结果或报告的组件。大部分的数据分析cruft都是作为分析师创建的。
Transformers are components used to modify the data before analyzing it. Sometimes it's neccessary to extract parts of a value or combine two values to correctly get an idea about a particular measure. In other scenarios, transformers can be used to perform reference data lookups or other similar tasks and place the results of an operation into the stream of data in the job. 转换器是用来在分析数据之前修改数据的组件。有时,需要提取某个值的某些部分或结合两个值来正确地理解某个特定的度量。在其他场景中,transformer可以用于执行引用数据查找或其他类似的任务,并将操作的结果放入作业中的数据流中。
The result of a transformer is a set of output columns. These columns work exactly like regular columns in your job, except that they have a preceding step in the flow before they become materialized. 转换器的结果是一组输出列。这些列与您的工作中的常规列完全相同,只是它们在实现之前在流中有一个前面的步骤。
Filters are components that split the flow of processing in a job. A filter will have a number of possible outcomes and depending on the outcome of a filter, a particular row might be processed by different sub-flows. Filters are often used simply to disregard certain rows from the analysis, eg. null values or values outside the range of interest. 过滤器是在工作中分割处理流程的组件。过滤器将会有很多可能的结果取决于过滤器的结果,特定的一行可能会被不同的子流处理。过滤器常常被用来忽略分析中的某些行,例如。null值或值之外的值。
Each of these components will be presented as a node in the job graph. Double-clicking a component (graph node) will bring its configuration dialog.
Transformers and filters are added to your job using the "Transform" and "Improve" menus. The menus are available in component library on the left or by right-clicking on an empty space in the canvas. Please refer to the reference chapter Transformations for more information on specific transformers and filters.
Analyzers are added to your job using the "Analyze" menu (in most cases), but also "Write" menu for analyzers that save output to a datastore. Please refer to the reference chapter Analyzers for more information on specific analyzers.
Wiring components together
Simply adding a transformer or filter actually doesn't change your job as such! This is because these components only have an impact if you wire them together.
简单地添加一个transformer 或filter 实际上不会改变你的工作!这是因为这些组件只有在将它们连接在一起时才会产生影响。
Transformer output
To wire a transformer you simply need to draw an arrow between the components in the graph. You can start drawing it by right-clicking the first of the components and choosing "Link to..." from the context menu. An alternative way to enter the drawing mode is to select the component and connect the components with Shift button pressed.
要连接transformer ,您只需在图中的组件之间绘制一个箭头。您可以通过右键单击第一个组件并从上下文菜单中选择“链接到…”来开始绘图。进入绘图模式的另一种方法是选择组件,并按按下的Shift键连接组件。
Filter requirement
To wire a filter you need to set up a dependency on either of it's outcomes. All components have a button for selecting filter outcomes in the top-right corners of their configuration dialogs. Click this button to select a filter outcome to depend on.
If you have multiple filters you can chain these simply by having dependent outcomes of the individual filters. This will require all filter requirements in the chain to be met, for a record to be passed to the component (AND logic).
Chained filters
Using "Link to...", it is also possible to wire several filters to a component in a kind of diamond shape. In that case, if any of the the filter requirements are met, the record will be passed to the component (OR logic).
"Diamond" filters “钻石”过滤器
Output data streams 数据输出流
The "Link to..." option wires components together in the "main flow". However, some components are able to produce additional output data streams. For example, the main feature of a Completeness Analyzer is to produce a summary of records completeness in the job result window. Additionally, it produces two output data streams - "Complete records" and "Incomplete records". Output data streams behave similarly to a source table, although such a table is created dynamically by a component. This enables further processing of such output.
Components producing output data streams have additional "Link to..." position in the right-click menu to wire the output with subsequent components.
Instead of wiring components with "Link to..." menu option, double-clicking a component brings up a configuration dialog that can be used to choose its input columns. In the top-right corner of the dialog, the scope of the component can be chosen. Switching between scopes gives us the possibility to choose input columns from the "main flow" (default scope) or from output data streams.
An example job using output data streams:
The canvas displays messages (in the bottom of the screen) which contain instructions with the next steps that need to be performed in other to build a valid job.
Executing jobs
When a job has been built you can execute it. To check whether your job is correctly configured and ready to execute, check the status bar in the bottom of the job building window.
To execute the job, simply click the "Execute" button in the top-right corner of the window. This will bring up the result window, which contains:
The Progress information tab which contains useful information and progress indications while the job is being executed. 进度信息选项卡,该选项卡在执行任务时包含有用的信息和进度指示。
Additional tabs for each component type that produces a result/report. For instance 'Value distribution' if such a component was added to the job. 每个组件类型的额外标签可以生成结果/报告。例如“值分布”如果这样一个组件添加到工作。
Here's an example of an analysis result window:
Saving and opening jobs
You can save your jobs in order to reuse them at a later time. Saving a job is simple: just click the "Save" button in the top panel of the window.
Analysis jobs are saved in files with the ".analysis.xml" extension. These files are XML files that are readable and editable using any XML editor.
分析的作业保存文件的扩展名为 ".analysis.xml" ,这些是xml文件,可以通过XML编辑器进行重新编辑
Opening jobs can be done using the "Open" menu item. Opening a job will restore a job building window from where you can edit and run the job.
Template jobs 模板作业
DataCleaner contains a feature where you can reuse a job for multiple datastores or just multiple columns in the same datastore. We call this feature 'template jobs'.
When opening a job you are presented with a file chooser. When you select a job file a panel will appear, containing some information about the job as well as available actions:
If you click the 'Open as template' button you will be presented with a dialog where you can map the job's original columns to a new set of columns:
如果您点击“Open as template”按钮,您将看到一个对话框,您可以将作业的原始列映射到一组新的列:
First you need to specify the datastore to use. On the left side you see the name of the original datastore, but the job is not restricted to use only this datastore. Select a datastore from the list and the fields below for the columns will become active.
Then you need to map individual columns. If you have two datastore that have the same column names, you can click the "Map automatically" button and they will be automatically assigned. Otherwise you need to map the columns from the new datastore's available columns.
然后需要映射各个列。如果您有两个具有相同列名的数据存储,您可以单击“Map automatic”按钮,它们将被自动分配。否则,您需要映射来自新数据存储的可用列的列。
Finally your job may contain 'Job-level variables'. These are configurable properties of the job that you might also want to fill out.
Once these 2-3 steps have been completed, click the "Open job" button, and DataCleaner will be ready for executing the job on a new set of columns!
Writing cleansed data to files 将清洗后的数据写入到文件中
Although the primary focus of DataCleaner is analysis, often during such analysis you will find yourself actually improving data by means of applying transformers and filters on it. When this is the case, obviously you will want to export the improved/cleansed data so you can utilize it in other situations than the analysis.
Please refer to the reference chapter Writers for more information on writing cleansed data.
