董晓涛(David Dong)

博客园 首页 新随笔 联系 订阅 管理

Counting Rows in SQL Server Integration Services

I am often asked if there is a way to capture row counts in SSIS. For example, a user may want to know how many rows passed along certain outputs of a conditional split in order to compare the ratio of say high- and low-value line items in a day's sales.

 

Integration Services provides a great way to do this, using the Row Count component. To use this component, first create a variable of integer type (the default, so that's easy) at a scope where you can see it from your Data Flow task.

 

Now add your Row Count component to the Data Flow at the point in the process where you would like to count rows. Edit the Row Count component and set its VariableName property to the name of the variable you created.

 

When you execute the Data Flow, the number of rows which pass through the Row Count component are written to the named variable. However, it's important to note that the variable value does not change until the Data Flow has completed. This is the same for all SSIS package variables referenced in the Data Flow, even when using the Script component, the values are locked when execution of the Data Flow starts and they are only updated at the end. (VB.NET Variables within the Script Component can be changed during the flow, but they cannot be used outside the script.)

 

So what can you do with this row count?

 

One use I like, is to populate a table which captures detailed package execution statistics. I create a number of variables, such as VarErrorRows, VarGoodRows. I use an ExecuteSQL task to write to a table with a statement such as:

 

INSERT INTO [My_Audit_Table]

( [Package name]

,[Machine name]

,[Username]

,[ErrorRows]

,[GoodRows]

,[Execution start time]

,[AcceptancePercent])

VALUES (?,?,?,?,?,?,?)

 

And I map the variables appropriately. Note that I can use system variables to map Package name, Machine name, User name and Execution start time.

 

Another interesting use is to audit a sample of the data in a process before loading the entire data set. For example, let's say I have an extract which is pretty expensive - an involved query, or pulling a flat file over a slow network connection from a remote log server. The data may be ok, but on the other hand it may have various quality issues that would prevent me from loading it to my warehouse. But it's expensive data to get to, so I want to audit its quality and load it if possible in a single operation.

 

I work this scenario using two Data Flows - the first to extract and audit, and the second to load.

 

In this case, I add a multicast immediately after the source adapter. One leg of the multicast goes straight to a raw file destination - so I capture the source to my integration server. Another leg of the multicast goes through a Row or Percent sample component to sample say 10% of the load. I immediately count the sampled rows into a variable, SampleSize. Next I apply whatever auditing logic I like - conditionally splitting out null keys, columns which have missing values, etc. At the end of my auditing I can add another Row Count component to capture GoodRows which passed the audit.

 

The second Data Flow is my real business logic to load the warehouse, but this flow sources from the Raw File rather than the original source - no need to stress that slow connection again, or to run that query twice.

 

The important thing is to add an expression to the precedence constraint between the two Data Flows. The second Data Flow only runs if the first succeeds and the expression GoodRows / SampleSize >= x evaluates True. X can be whatever value you like, or could even be another variable, configured from an XML file if you like.

 

And of course you can capture all these metrics into an audit table just as before.

 

However, the Row Count is not the answer to all row counting scenarios. One common question is how many rows were inserted to a destination? You may be tempted to count these with a RowCount component just before the destination component. This may work for a flat file, but not reliably for a database - because it would not take into account rows which failed to be inserted. So there are two patterns you could use for OLEDB destinations:

o        Use an ExecuteSQL task to count rows before and after the Data Flow has executed, and compare;

o        Add a RowCount component before the destination, and one on the error output of the destination, and compare the values of the two variables after the Data Flow has completed.

 

But let's not end on a negative. I love the Row Count component, so here's another neat use. The Row Count component can be used without an output. In other words, it can, in effect, be a destination. This is as cool as a very cool thing indeed, because it means you can run a Data Flow and debug it without the data going anywhere. Think of it - no more temp tables, or dummy text files, just to get your process working while you debug. Develop first using a Row Count destination and then, when you're happy with the process, hook up a real destination and you're ready to go - after some final testing, of course.

 

I hope this gives some insight into this elegantly simple, often overlooked, but very valuable little component.

 

posted on 2005-04-13 11:23  董晓涛  阅读(499)  评论(0编辑  收藏  举报