Fuzzy Lookup & Fuzzy Grouping

这两个任务的作用是数据清洗（Data Cleansing）。

Fuzzy Lookup通过引用另外一张数据库表或者索引来进行相似值匹配。这种组件对于标准化和查找可能错误的客户端数据非常有用。例如像地址或者像城市名这种属性栏位非常有用。

Fuzzy Lookup不仅会输出它的匹配值，同时还会输出similarity和confidence两个属性列。similarity用一个0到1之间的浮点值来表示匹配对间值得相似度。比如Jerry Chan和Jerry Chen的相似度可能是0.89。而对于Confidence，它的值越高代表它可选的匹配对越少。

Fuzzy Lookup一共有4种选择来配置参考表（Reference Table）：

1）Generate New Index：根据参考表的参考栏位在内存中建立一条临时索引用来做数据匹配，任务完成后把它删除；

2）Generate New Index + Store New Index选项：相当于建立一条索引在数据库中；

3）Generate New Index + Store New Index选项 + Maintain Stored Index选项：这种情况下勾了Maintain Stored Index选项将会在reference表建一个触发器来捕捉更新以同步更新到该新建的索引；

4）Use Existing Index：从已有的数据库索引中挑选一个作为参考索引；

而在Advanced页面，

Maximum number of matches to output per lookup: 限制每个输入键值最大的输出匹配值对；

Similarity Threshold：相似度起步值

Token Delimiters：这个和Data Profiling中的Token Delimiters类似，把输入栏位的值按给定的Token Delimiters拆分成Token，为后面来的fuzzy lookup服务；

属性列表中有两个设置需要注意：

Exhaustive：当被设置为True时意味着每条input的record在做lookup的时候会和reference table中的所有记录进行匹配lookup。这样做结果当然更精确当时如果reference table大的情况性能代价就很大，默认为false；

WarmCaches：当被设置为True的情况下，reference table和index会被提前加载如内存；

Fuzzy Grouping和Fuzzy Lookup类似，因为它会根据你给定的similarity的程度来返回某个或者多个栏位的cleansed的值（grouping field），而其实这个值最后就是把一些记录group起来。

Fuzzy Lookup Transformation: Capable of joining to external data based on data similarity,
the Fuzzy Lookup Transformation is a core data cleansing tool in SSIS. This transformation
is perfect if you have dirty data input that you want to associate to data in a table in your
database based on similar values. Later in the chapter, you’ll take a look at the details of the
Fuzzy Lookup Transformation and what happens behind the scenes

Fuzzy Grouping Transformation: The main purpose is de-duplication of similar data. The
Fuzzy Grouping Transformation is ideal if you have data from a single source and you know
you have duplicates that you need to find.

posted @ 2015-06-13 16:54 Jerry_Chen 阅读(819) 评论(0) 编辑收藏举报

刷新页面返回顶部

Data Flow ->> Fuzzy Lookup & Fuzzy Grouping

公告