Counting Rows in
SQL Server Integration Services
I am often asked if there is a way to capture row counts in SSIS. For
example, a user may want to know how many rows passed along certain outputs of
a conditional split in order to compare the ratio of say high- and low-value
line items in a day
Integration Services provides a great way to do this, using the Row
Count component. To use this component, first create a variable of integer type
(the default, so that
Now add your Row Count component to the Data Flow at the point in the
process where you would like to count rows. Edit the Row Count component and
set its VariableName property to the name of the variable you created.
When you execute the Data Flow, the number of rows which pass through
the Row Count component are written to the named variable. However, it
So what can you do with this row count?
One use I like, is to populate a table which captures detailed package
execution statistics. I create a number of variables, such as VarErrorRows,
VarGoodRows. I use an ExecuteSQL task to write to a table with a statement such
as:
INSERT INTO [My_Audit_Table]
(
[Package name]
,[Machine
name]
,[Username]
,[ErrorRows]
,[GoodRows]
,[Execution
start time]
,[AcceptancePercent])
VALUES (?,?,?,?,?,?,?)
And I map the variables appropriately. Note that I can use system
variables to map Package name, Machine name, User name and Execution start
time.
Another interesting use is to audit a sample of the data in a process
before loading the entire data set. For example, let
I work this scenario using two Data Flows - the first to extract and
audit, and the second to load.
In this case, I add a multicast immediately after the source adapter.
One leg of the multicast goes straight to a raw file destination - so I capture
the source to my integration server. Another leg of the multicast goes through
a Row or Percent sample component to sample say 10% of the load. I immediately
count the sampled rows into a variable, SampleSize. Next I apply whatever
auditing logic I like - conditionally splitting out null keys, columns which
have missing values, etc. At the end of my auditing I can add another Row Count
component to capture GoodRows which passed the audit.
The second Data Flow is my real business logic to load the warehouse,
but this flow sources from the Raw File rather than the original source - no
need to stress that slow connection again, or to run that query twice.
The important thing is to add an expression to the precedence
constraint between the two Data Flows. The second Data Flow only runs if the
first succeeds and the expression GoodRows / SampleSize >= x evaluates True.
X can be whatever value you like, or could even be another variable, configured
from an XML file if you like.
And of course you can capture all these metrics into an audit table
just as before.
However, the Row Count is not the answer to all row counting
scenarios. One common question is how many rows were inserted to a destination?
You may be tempted to count these with a RowCount component just before the destination
component. This may work for a flat file, but not reliably for a database -
because it would not take into account rows which failed to be inserted. So
there are two patterns you could use for OLEDB destinations:
o
Use
an ExecuteSQL task to count rows before and after the Data Flow has executed,
and compare;
o
Add
a RowCount component before the destination, and one on the error output of the
destination, and compare the values of the two variables after the Data Flow
has completed.
But let
I hope this gives some insight into this elegantly simple, often
overlooked, but very valuable little component.