DataCleaner(4.5)第二章
Chapter 2. Getting started with DataCleaner desktop
Table of Contents
|--Installing the desktop application 安装桌面应用
|--Connecting to your datastore 连接你的数据存储
|--Adding components to the job 添加组件到job中
|--Wiring components together 连接组件
-->Transformer output 转换输出
-->Filter requirement 过滤组件
-->Output data streams 输出数据流
|--Executing jobs 执行作业
|--Saving and opening jobs 保存和打开作业
|--Template jobs 模板作业
|--Wrting cleansed data to filies 将清理后的数据写入文件
Installing the desktop application
These are the system requirements of DataCleaner:
安装DataCleaner系统的必要条件
- A computer (with a graphical display, except if run in command-line mode). 电脑
- A Java Runtime Environment (JRE), version 7 or higher. JDK7或者更好版本
- A DataCleaner software license file for professional editions. If you've requested a free trial or purchased DataCleaner online, this file will have been sent to your email address. 用于专业版本的DataCleaner软件许可证文件。如果你已经要求免费试用或在网上购买DataCleaner,这个文件将会被发送到你的电子邮件地址。
Start the installation procedure using the installer program. The installer program is an executable JAR file, which is executable on most systems if you simply double-click it.
使用安装程序启动安装过程。安装程序是一个可执行的JAR文件,它在大多数系统上是可执行的,只要你双击它即可。
Tip
If the installer does not launch when you double-click it, open a command prompt and enter:
如果安装程序在双击它时不启动,则打开命令提示并输入:
java -jar DataCleaner-[edition]-[version]-install.jar
Troubleshooting 故障排除
Usually the installation procedure is trivial and self-explanatory. But in case something is not working as expected, please check the following points:
通常,安装过程是琐碎的和自解析的。但是万一有什么事情不像预期的那样工作,请检查以下几点:
-
On Windows systems, if you do not have Administrative privileges on the machine, we encourage you to install DataCleaner in your user's directory instead of in 'Program Files'. 在Windows系统上,如果您在机器上没有管理权限,我们鼓励您在用户的目录中安装DataCleaner,而不是在“程序文件”中。
-
On some Windows systems you can get a warning ' There is no script engine for file extension '.js' '. This happens when .js files (JavaScript) files are associated with an editor instead of Windows' built-in scripting engine. To resolve this issue, please refer to these help links: 在一些Windows系统上,您可以得到一个警告:“没有脚本引擎可以进行文件扩展”。js ' '。当.js文件(JavaScript)文件与一个编辑器(而不是Windows的内置脚本引擎)相关联时,就会出现这种情况。为了解决这个问题,请参考以下帮助链接:
-
answers.microsoft.com , which address the issue and recommends...
-
winhelponline.com , which has a fix for the issue
-
If you have issues with locating or selecting the software license file, you can skip the step in the installer by copying the license file manually to this folder: '~/.datacleaner' (where ~ is your User's home folder). Note that on Windows machines it is prohibited by Windows explorer to create directories starting with dot (.), but it can be done using the command prompt: 如果您有关于定位或选择软件许可文件的问题,您可以通过手动将许可文件复制到这个文件夹来跳过安装程序中的步骤:'~/。datacleaner(在这里~是您的用户的主文件夹)。注意,Windows资源管理器禁止在Windows机器上创建以dot(.)开头的目录,但是可以使用命令提示符:
mkdir .datacleaner
Connecting to your datastore 连接你的数据存储
Below is a screenshot of the initial screen that will be presented when launching DataCleaner (desktop community edition). A new datastore can be added in the "New job from scratch" or in "Manage datastores" screens available by clicking the buttons in the bottom of the screen.
下面是启动DataCleaner(桌面社区版)时将展示的初始屏幕截图。一个新的数据存储可以被添加到"新工作"或者"管理数据存储"的屏幕上点击屏幕底部的按钮。
File datastores can be added using a drop zone (or browse button) located in the top of the screen. Below, there are buttons that enable adding databases or cloud services.
可以使用位于屏幕顶部的drop zone(或浏览按钮)添加文件数据存储。下面是可以添加数据库或云服务的按钮。
If the file is added using the drop zone, its format will be inferred. If you need more control over how the file is going to be interpreted, use the alternative way to add a new datastore - "Manage datastores" button in the welcome screen.
如果文件是使用drop zone添加的,那么它的格式将被推断出来。如果您需要对如何解释文件进行更多的控制,请使用另一种方法在欢迎屏幕中添加一个新的datastore—“Manage datastores”按钮。
The "Datastore management" screen - except from viewing and editing existing datastores - has an option to add a new one based on its type. Choose an icon in the bottom of the screen that suits your datastore type.
“数据存储管理”屏幕——除了查看和编辑现有的数据存储之外——还可以根据其类型添加新的数据存储。在屏幕底部选择适合您的数据存储类型的图标。
Once you've registered ('created') your own datastore, you can select it from the list and (in "New job from scratch" screen) or select it from the list and click "Build job" (in "Datastore Management" screen) to start working with it!
一旦你注册(“创建”)自己的数据存储,你可以从列表中选择它,(在屏幕“从头开始新工作”)或从列表中选择它,然后单击“构建”(在“数据存储管理”屏幕)开始使用它!
Tip
You can also configure your datastore by means of the configuration file (conf.xml), which has both some pros and some cons. For more information, read the configuration file chapter .
您还可以通过配置文件(conf.xml)配置您的数据存储,它既有优点也有缺点。
Adding components to the job
There are a few different kinds of components that you can add to your job:
有几种不同的组件可以添加到您的工作中:
-
Analyzers , which are the most important components. Actually, without at least one analyzer the job will not run (if you execute the job without adding one, DataCleaner will suggest adding a basic one that will save the output to a file). An analyzer is a component that inspects the data that it receives and generates a result or a report. The majority of the data profiling cruft is created as analyzers. 分析器,它是最重要的组件。实际上,如果没有至少一个分析器,作业就不会运行(如果在不添加一个分析程序的情况下执行该任务,DataCleaner将建议添加一个基本的工作,它将把输出保存到文件中)。分析器是检查接收到的数据并生成结果或报告的组件。大部分的数据分析cruft都是作为分析师创建的。
-
Transformers are components used to modify the data before analyzing it. Sometimes it's neccessary to extract parts of a value or combine two values to correctly get an idea about a particular measure. In other scenarios, transformers can be used to perform reference data lookups or other similar tasks and place the results of an operation into the stream of data in the job. 转换器是用来在分析数据之前修改数据的组件。有时,需要提取某个值的某些部分或结合两个值来正确地理解某个特定的度量。在其他场景中,transformer可以用于执行引用数据查找或其他类似的任务,并将操作的结果放入作业中的数据流中。
The result of a transformer is a set of output columns. These columns work exactly like regular columns in your job, except that they have a preceding step in the flow before they become materialized. 转换器的结果是一组输出列。这些列与您的工作中的常规列完全相同,只是它们在实现之前在流中有一个前面的步骤。
-
Filters are components that split the flow of processing in a job. A filter will have a number of possible outcomes and depending on the outcome of a filter, a particular row might be processed by different sub-flows. Filters are often used simply to disregard certain rows from the analysis, eg. null values or values outside the range of interest. 过滤器是在工作中分割处理流程的组件。过滤器将会有很多可能的结果取决于过滤器的结果,特定的一行可能会被不同的子流处理。过滤器常常被用来忽略分析中的某些行,例如。null值或值之外的值。
Each of these components will be presented as a node in the job graph. Double-clicking a component (graph node) will bring its configuration dialog.
每个组件都将作为作业图中的节点呈现。双击一个组件(图形节点)将会带来它的配置对话框。
Transformers and filters are added to your job using the "Transform" and "Improve" menus. The menus are available in component library on the left or by right-clicking on an empty space in the canvas. Please refer to the reference chapter Transformations for more information on specific transformers and filters.
使用“Transform”和“Improve”菜单将Transformers和filters添加到您的工作中。菜单可以在左边的组件库中使用,也可以在画布上的空白区域右键单击。有关具体变压器和过滤器的更多信息,请参阅参考章节转换。
Analyzers are added to your job using the "Analyze" menu (in most cases), but also "Write" menu for analyzers that save output to a datastore. Please refer to the reference chapter Analyzers for more information on specific analyzers.
在你的工作中,分析程序会用"分析"菜单添加到你的工作中,但同时也要写"写"菜单,用于保存输出到数据存储的分析程序。请参阅参考章节分析程序以获得更多关于特定分析程序的信息。
Wiring components together
Simply adding a transformer or filter actually doesn't change your job as such! This is because these components only have an impact if you wire them together.
简单地添加一个transformer 或filter 实际上不会改变你的工作!这是因为这些组件只有在将它们连接在一起时才会产生影响。
Transformer output
To wire a transformer you simply need to draw an arrow between the components in the graph. You can start drawing it by right-clicking the first of the components and choosing "Link to..." from the context menu. An alternative way to enter the drawing mode is to select the component and connect the components with Shift button pressed.
要连接transformer ,您只需在图中的组件之间绘制一个箭头。您可以通过右键单击第一个组件并从上下文菜单中选择“链接到…”来开始绘图。进入绘图模式的另一种方法是选择组件,并按按下的Shift键连接组件。
Filter requirement
To wire a filter you need to set up a dependency on either of it's outcomes. All components have a button for selecting filter outcomes in the top-right corners of their configuration dialogs. Click this button to select a filter outcome to depend on.
要连接一个filter,您需要设置对其中任何一个结果的依赖。所有组件在配置对话框的右上角有一个按钮用于选择filter结果。单击此按钮可选择要依赖的filter结果。
If you have multiple filters you can chain these simply by having dependent outcomes of the individual filters. This will require all filter requirements in the chain to be met, for a record to be passed to the component (AND logic).
如果有多个过滤器,可以通过单独过滤器的相关结果简单地将这些过滤器链接。这将要求满足链中的所有过滤器要求,以便将记录传递到部件(和逻辑)。
Using "Link to...", it is also possible to wire several filters to a component in a kind of diamond shape. In that case, if any of the the filter requirements are met, the record will be passed to the component (OR logic).
使用“链接……”此外,还可以将几个过滤器连接到一种钻石形状的组件上。在这种情况下,如果满足了任何筛选条件,则记录将传递给组件(或逻辑)。
Output data streams 数据输出流
The "Link to..." option wires components together in the "main flow". However, some components are able to produce additional output data streams. For example, the main feature of a Completeness Analyzer is to produce a summary of records completeness in the job result window. Additionally, it produces two output data streams - "Complete records" and "Incomplete records". Output data streams behave similarly to a source table, although such a table is created dynamically by a component. This enables further processing of such output.
“链接到……”选项将组件连接在“主流”中。但是,有些组件能够产生额外的输出数据流。例如,完整性分析器的主要功能是在作业结果窗口中生成记录完整性的摘要。此外,它还生成两个输出数据流——“完整记录”和“不完整记录”。输出数据流的行为类似于源表,尽管这样的表是由组件动态创建的。这可以进一步处理这样的输出。
Components producing output data streams have additional "Link to..." position in the right-click menu to wire the output with subsequent components.
生成输出数据流的组件在右键菜单中有附加的“链接到…”的位置,以将输出与后续组件连接起来。
Instead of wiring components with "Link to..." menu option, double-clicking a component brings up a configuration dialog that can be used to choose its input columns. In the top-right corner of the dialog, the scope of the component can be chosen. Switching between scopes gives us the possibility to choose input columns from the "main flow" (default scope) or from output data streams.
与“链接到…”菜单选项的连接组件不同,双击一个组件会弹出一个配置对话框,可以用来选择它的输入列。在对话框的右上角,可以选择组件的范围。在作用域之间切换使我们有可能从“主流”(默认范围)或输出数据流中选择输入列。
An example job using output data streams:
Tip
The canvas displays messages (in the bottom of the screen) which contain instructions with the next steps that need to be performed in other to build a valid job.
画布显示消息(在屏幕的底部),其中包含指示,接下来的步骤需要在其他步骤中执行,以构建有效的工作。
Executing jobs
When a job has been built you can execute it. To check whether your job is correctly configured and ready to execute, check the status bar in the bottom of the job building window.
当一份工作被建立后,你可以执行它。要检查您的作业是否正确配置并准备执行,请检查作业构建窗口底部的状态栏。
To execute the job, simply click the "Execute" button in the top-right corner of the window. This will bring up the result window, which contains:
要执行任务,只需单击窗口右上角的“execute”按钮。这将打开结果窗口,其中包含:
-
The Progress information tab which contains useful information and progress indications while the job is being executed. 进度信息选项卡,该选项卡在执行任务时包含有用的信息和进度指示。
-
Additional tabs for each component type that produces a result/report. For instance 'Value distribution' if such a component was added to the job. 每个组件类型的额外标签可以生成结果/报告。例如“值分布”如果这样一个组件添加到工作。
Here's an example of an analysis result window:
Saving and opening jobs
You can save your jobs in order to reuse them at a later time. Saving a job is simple: just click the "Save" button in the top panel of the window.
您可以保存您的作业,以便以后重用它们。保存工作很简单:只需点击窗口顶部面板中的“保存”按钮即可。
Analysis jobs are saved in files with the ".analysis.xml" extension. These files are XML files that are readable and editable using any XML editor.
分析的作业保存文件的扩展名为 ".analysis.xml" ,这些是xml文件,可以通过XML编辑器进行重新编辑
Opening jobs can be done using the "Open" menu item. Opening a job will restore a job building window from where you can edit and run the job.
打开工作可以使用“打开”菜单项。打开一份工作将从你可以编辑和运行工作的地方恢复一个工作窗口。
Template jobs 模板作业
DataCleaner contains a feature where you can reuse a job for multiple datastores or just multiple columns in the same datastore. We call this feature 'template jobs'.
DataCleaner包含一个特性,可以在同一数据存储中重用多个数据存储或多个列。我们称之为“模板工作”。
When opening a job you are presented with a file chooser. When you select a job file a panel will appear, containing some information about the job as well as available actions:
当打开一个工作时,你会得到一个文件选择器。当您选择一个作业文件时,将会出现一个面板,包含有关该作业的一些信息以及可用的操作:
If you click the 'Open as template' button you will be presented with a dialog where you can map the job's original columns to a new set of columns:
如果您点击“Open as template”按钮,您将看到一个对话框,您可以将作业的原始列映射到一组新的列:
First you need to specify the datastore to use. On the left side you see the name of the original datastore, but the job is not restricted to use only this datastore. Select a datastore from the list and the fields below for the columns will become active.
首先需要指定要使用的数据存储。在左侧,您可以看到原始数据存储的名称,但是作业不限于仅使用此数据存储。从列表中选择一个datastore,下面的字段将被激活。
Then you need to map individual columns. If you have two datastore that have the same column names, you can click the "Map automatically" button and they will be automatically assigned. Otherwise you need to map the columns from the new datastore's available columns.
然后需要映射各个列。如果您有两个具有相同列名的数据存储,您可以单击“Map automatic”按钮,它们将被自动分配。否则,您需要映射来自新数据存储的可用列的列。
Finally your job may contain 'Job-level variables'. These are configurable properties of the job that you might also want to fill out.
最后,你的工作可能包含“工作级别的变量”。这些是您可能想要填写的作业的可配置属性。
Once these 2-3 steps have been completed, click the "Open job" button, and DataCleaner will be ready for executing the job on a new set of columns!
完成这2-3个步骤后,单击“打开作业”按钮,DataCleaner将准备好执行新一组列的工作!
Writing cleansed data to files 将清洗后的数据写入到文件中
Although the primary focus of DataCleaner is analysis, often during such analysis you will find yourself actually improving data by means of applying transformers and filters on it. When this is the case, obviously you will want to export the improved/cleansed data so you can utilize it in other situations than the analysis.
虽然DataCleaner的主要关注点是分析,但是在这种分析过程中,您会发现自己实际上是通过应用transformer和过滤器来改进数据的。在这种情况下,显然您将希望导出改进的/清理的数据,以便在其他情况下使用它,而不仅仅是分析。
Please refer to the reference chapter Writers for more information on writing cleansed data.