T-SQL User-Defined Functions: the good, the bad, and the ugly (part 1)
So you thought that encapsulating code in user-defined functions for easy reuse is a good idea? Think again!
SQL Server supports three types of user-defined functions. Only one of them qualifies as good. The other two – well, the title says it all, doesn’t it?
The bad: scalar functions
A scalar user-defined function (UDF) is very much like a stored procedure, except that it always returns a single value of a predefined data type – and because of that property, it isn’t invoked with an EXECUTE statement, but embedded in an expression where the returned value is immediately used. I won’t explain all the basics, but assume that you are either already familiar with the concept, or that you at least have read the description in Books Online. It is allowed to read (but not modify!) table data from within a scalar UDF, but in this blog posts I will focus on scalar UDFs that include computations and other expressions only, without doing any data access.
The code below defines and then uses a very simple scalar UDF that simply triples the input:
CREATE FUNCTION dbo.Triple(@Input int)
RETURNS int
AS
BEGIN;
DECLARE @Result int;
SET @Result = @Input * 3;
RETURN @Result;
END;
go
SELECT DataVal, dbo.Triple(DataVal) AS Triple
FROM dbo.LargeTable;
This example is obviously overly simple – even the most enthusiastic devotee of isolating and reusing code will never bother to define and use a function for something so simple. But if the calculation in the function is actually very complex, it’s easy to see how code that defines the calculation once and then simply invokes the function every time it’s needed is easier to build, understand, debug, and maintain than code that repeats the complex expression at several locations. In traditional programming languages, like C# or VB.Net, it’s easy to see why using functions to encapsulate and reuse common computations is considered a best practice.
But SQL Server isn’t a traditional programming language. It’s a declarative language, with an optimizer that has been designed to optimize the execution order within queries to make sure that the results are returned as fast as possible – without impacting correctness, of course. And that optimizer simply cannot optimize code that calls scalar UDFs as well as it can optimize code that has the same logic inlined. Let’s first take a look – in order to test the performance, I’ll create a table with 100,000 rows with random values from 1 to 10 in the DataVal column.
CREATE TABLE dbo.LargeTable
(KeyVal int NOT NULL PRIMARY KEY,
DataVal int NOT NULL CHECK (DataVal BETWEEN 1 AND 10)
);
WITH Digits
AS (SELECT d FROM (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) AS d(d))
INSERT INTO dbo.LargeTable (KeyVal, DataVal)
SELECT 10000 * tt.d + 1000 * st.d
+ 100 * h.d + 10 * t.d + s.d + 1,
10 * RAND(CHECKSUM(NEWID())) + 1
FROM Digits AS s, Digits AS t, Digits AS h,
Digits AS st, Digits AS tt;
The code may be a bit complex and you may be tempted to write a simple loop to insert 100,000 rows. But that would take a lot more time – the code above runs in less than 1 second on my laptop, whereas a loop takes almost five seconds. When we need more rows (later), this difference becomes even more noticeable.
The first test
Now it’s time to test. Below are two queries. The first query is designed to calculate the triple of each DataVal value in the table. (Note that I added the MAX aggregate to ensure that the actual performance would not be impacted by the overhead of returning 100,000 rows to SSMS and rendering them in the result display). The second query is exactly the same, except that it doesn’t use the scalar UDF, but includes (“inlines”) the actual formula in the query.
SELECT MAX(dbo.Triple(DataVal)) AS MaxTriple
FROM dbo.LargeTable;
SELECT MAX(3 * DataVal) AS MaxTriple
FROM dbo.LargeTable;
If you select the two queries, activate the option to include the actual execution plan, hit execute, and then switch to the execution plan tab, you may be pretty happy:
I must add that I didn’t get these plans all the time. In my first tests, the plans were equal, with the only difference being the actual content (visible only in the Defined Values property) of the Compute Scalar iterator – the arithmetic formula 3 * DataVal vs. invoking dbo.Triple; in those cases both plans were said to have a cost of 50%. In later tests, the plans changed to the above; the call to dbo.Triple is now hidden several levels deep in the Defined Values property of the Stream Aggregate iterator, and though the same work is still done, the first query is now said to be suddenly slightly cheaper than the second. But either way, whether 50% each or 49% vs 51%, the scalar UDF seems to be a very good choice.
However, you may not be aware that the “Actual Execution Plan” is a dirty rotten liar. Or maybe I should say that the terms “Actual Execution Plan” and “Estimated Execution Plan” are misleading. There is only one execution plan, it gets created when the queries are compiled, and then the queries are executed. The only difference between the “Actual” and the “Estimated” execution plan is that the estimated plan only tells you the estimates for how many rows flow between iterators and how often iterators are executed, and the actual plan adds the actual data for that. But no “actual operator cost” or “actual subtree cost” is added to the corresponding estimated values – and since those costs are the values that the percentages are based on, the percentages displayed in an actual execution plan are still based only on the estimates.
To get a better idea of the actual query cost, let’s run these queries again – this time without the execution plan, but with the actual duration displayed. You can enable this display with the option Query / Query Options / Advances / SET STATISTICS TIME. Or you can simply add the statement SET STATISTICS TIME ON; at the start of the batch (and SET STATISTICS TIME OFF; at the end). Now, if you run the queries, you’ll get some extra information returned (in the Text tab of the results pane) that tells you exactly how long the query took (elapsed time) and how much CPU time it used (CPU time). On my laptop, the query that uses the UDF takes 889 ms CPU and 900 ms elapsed, and the query with the logic inlined takes only a fraction of that: 47 ms CPU and 52 ms elapsed! Not 49% versus 51%, but 95% versus 5%.
Why?
This huge performance difference is caused by the overhead of calling and executing a scalar UDF. The computation of 3 * DataVal in the second query is entirely executed inside an iterator (the Compute Scalar), which is very efficient. The computation of dbo.Triple(DataVal) in the first query is also executed in an iterator (the Stream Aggregate, in this case) – but since this is a call to a separate module, SQL Server will have to invoke this module for each of the 100,000 rows flowing through that iterator. Each of those 100,000 calls introduces some overhead: start up the execution, step through the two executable statements of the function body, and clean up afterwards. Some sources claim that the function text is interpreted (compiled) on each call; I found that this is –at least on SQL Server 2012– not the case; when executing this code with a profiler trace running, only a single cache hit (or cache insert if the function is not in the procedure cache) event is fired.
This overhead is invisible in the “Actual Execution Plan”, but the execution time and the profiler trace tell a different story. And so does the “Estimated Execution Plan” – if I select the query that uses the function and then request at “Estimated Execution Plan”, I get two execution plans: one for the query that we saw before, and one for the function body, with two iterators that represent the executable statements: a SET with a calculation, and a RETURN.
But note that, again, the execution plan is lying. First, it implies that the UDF is invoked only once, which is not the case. Second, look at the cost. You may think that the 0% is the effect of rounding down, since a single execution of the function costs so little in relation to the cost of accessing and aggregating 100,000 rows. But if you check the properties of the iterators of the plan for the function, you’ll see that all operator and subtree costs are actually estimated to be exactly 0. This lie is maybe the worst of all – because it’s not just the plan lying to us, it is SQL Server lying to itself. This cost estimate of 0 is actually used by the query optimizer, so all plans it produces are based on the assumption that executing the function is free. As a result, the optimizer will not even consider optimizations it might use if it knew how costly calling a scalar UDF actually is.
Determinism
You may have wondered why SQL Server even bothers to invoke the UDF 100,000 times. Based on the CHECK constraint, there can never be more than 10 distinct values in the DataVal column – so why doesn’t the optimizer transform the execution plan to first get the distinct values of DataVal, then call the UDF only for those? Surely, that would be more efficient? Yes, it would, as we can easily verify by making that transformation ourselves:
SET STATISTICS TIME ON;
SELECT MAX(dbo.Triple(DataVal)) AS MaxTriple
FROM (SELECT DISTINCT DataVal
FROM dbo.LargeTable) AS d;
SELECT MAX(3 * DataVal) AS MaxTriple
FROM (SELECT DISTINCT DataVal
FROM dbo.LargeTable) AS d;
SET STATISTICS TIME OFF;
If you check the returned execution times, you will see that this technique even helps the performance of the query without function, if only by a little bit – 47 ms CPU and 50 ms elapsed on my laptop. For the version with scalar UDF, the saving is significant, as it is now almost as efficient as the version without scalar UDF: 62 ms CPU and 51 ms elapsed.
So why does the optimizer not make this transformation by itself? There are two reasons for that. The first is that with this version of the UDF, it can’t guarantee that this transformation won’t change result, because of a property called “determinism”. If a function is deterministic, we can be sure that when it is invoked multiple times with the same arguments, it will always return the same result. If a function is not deterministic, it might return different results, even when the same parameters are passed in. Our scalar UDF is not deterministic, as this query shows:
SELECT OBJECTPROPERTY(OBJECT_ID('dbo.Triple'), 'IsDeterministic');
You can check Books Online for a list of all the requirements a function has to meet to be deterministic. In our case, the only problem is that the UDF is not schemabound, so let’s remedy that:
ALTER FUNCTION dbo.Triple(@Input int)
RETURNS int
WITH SCHEMABINDING
AS
BEGIN;
DECLARE @Result int;
SET @Result = @Input * 3;
RETURN @Result;
END;
go
SELECT OBJECTPROPERTY(OBJECT_ID('dbo.Triple'), 'IsDeterministic');
The function is now marked as deterministic – but if you rerun the previous tests, you’ll see that this does not affect plan choice for these queries at all! The optimizer still won’t shuffle the plan of the first query to match that of the second, even though they are now (with the function marked deterministic) guaranteed to be equivalent. That is because there is a second reason why the optimizer won’t make this change – and that is that the optimizer thinks that invoking the function has a zero cost. Why would it even consider a complicated plan transform that saves 99,990 function calls if it thinks that those calls are free? After all, zero multiplied by 99,990 is still zero. Unfortunately, whereas we can affect determinism of a function, we cannot influence the cost estimate the optimizer uses for it.
This same zero cost estimate leads to more bad plan choices. For instance, in the query below, the optimizer will happily invoke the scalar UDF two times for each row: once for the WHERE and once for the SELECT:
SELECT 1 - dbo.Triple(DataVal)
FROM dbo.LargeTable
WHERE dbo.Triple(DataVal) > 20;
It gets worse
Unfortunately, these two (overhead times the number of rows and bad cost estimate affecting plan choice) are not the only problems with scalar UDFs. There is a third problem: SQL Server will never use parallelism in a plan that uses scalar UDFs. This becomes only visible with larger tables, so let’s get rid of our 100,000 test rows and replace them with ten million fresh ones:
TRUNCATE TABLE dbo.LargeTable;
WITH Digits
AS (SELECT d FROM (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) AS d(d))
INSERT INTO dbo.LargeTable (KeyVal, DataVal)
SELECT 1000000 * sm.d
+ 100000 * ht.d + 10000 * tt.d + 1000 * st.d
+ 100 * h.d + 10 * t.d + s.d + 1,
10 * RAND(CHECKSUM(NEWID())) + 1
FROM Digits AS s, Digits AS t, Digits AS h,
Digits AS st, Digits AS tt, Digits AS ht,
Digits AS sm;
If we now execute our original queries again, we will see two changes over the first time, when we used 100,000 rows. The first change is that now the plans are not the same; the plan for the query with scalar UDF is still the same, but the plan for the query without scalar UDF introduces parallelism.
The second change is maybe not really a change – it’s the observation that the percentages in the plan are still way off. On my laptop, the query with UDF takes 40108 ms CPU and 43760 ms elapsed to process all million rows; the query without UDF does the same in 4397 ms CPU and 808 ms elapsed. Based on CPU usage, the UDF version takes 90% of the batch (making the difference slightly less than in the non-parallel version – this is caused by the overhead of synchronizing over all the threads and combining the results); based on elapsed time, it’s even 98% (based on all cores in my laptop working instead of just one).
I made another interesting (and slightly disturbing) observation when I looked at the execution plans of the queries that force the optimizer to first find the distinct values of DataVal and then only invoke compute the triple of those distinct value:
The version without UDF uses the parallel part of the plan to read all rows and find the distinct DataVal values, then computes the triple for those distinct values in the serial part. I would have expected a similar plan for the version with UDF (since the UDF would only be called in the serial part), but apparently, the mere presence of a scalar UDF in the query prevents any form of parallelism for the entire execution plan!
The remedy
If you care about performance, you should avoid the use of scalar UDFs, except in situations where their performance hit doesn’t hurt. For instance, in a SET statement without subquery, a UDF does not hurt you, because it will be invoked only once. And in a SELECT statement that processes only a very small table and is not part of a time-critical part of your system, the performance hit doesn’t really matter much (but don’t think that it’s safe to use a scalar UDF in a query that returns only a few rows from a large table – sometimes the optimizer will produce a plan where the evaluation of the UDF is pushed down to a part of the plan that is executed before the majority of the rows is filtered out!)
The obvious workaround is to not use a scalar UDF at all, but instead inline the code. For the Triple function I used here, this is dead simple. If you have a UDF that contains multiple statements, calculating and storing intermediate results in variables, doing conditional logic based on IF … ELSE blocks,etc – this can be quite hard. You may have to use complicated CASE expressions, and you may have to repeat expressions multiple times (or use CTEs to avoid that duplication). But the performance gain will make up for the effort! Just don’t forget to carefully comment and document the sometimes hideous queries this may result in. As an example of what I mean, look at this scalar UDF and the corresponding inline rewrite (and if you want to know what the use of this UDF is, there is none; it’s just some nonsense I made up).
CREATE FUNCTION dbo.Nonsense(@Input int)
RETURNS int
WITH SCHEMABINDING
AS
BEGIN;
DECLARE @Result int,
@BaseDate date,
@YearsAdded date,
@WeekDiff int;
SET @BaseDate = '20000101';
SET @YearsAdded = DATEADD(year, @Input, @BaseDate);
IF @Input % 2 = 0
BEGIN;
SET @Result = DATEDIFF(day, @YearsAdded, @BaseDate)
- DATEDIFF(month, @YearsAdded, @BaseDate);
END;
ELSE
BEGIN;
SET @WeekDiff = DATEDIFF(week, @BaseDate, @YearsAdded);
SET @Result = (100 + @WeekDiff) * (@WeekDiff - 100);
END;
RETURN @Result;
END;
go
SELECT KeyVal, DataVal,
dbo.Nonsense(DataVal)
FROM dbo.LargeTable
WHERE KeyVal <= 100;
WITH MyCTE
AS (SELECT KeyVal, DataVal,
CAST('20000101' AS date) AS BaseDate,
DATEADD(year, DataVal, CAST('20000101' AS date)) AS YearsAdded
FROM dbo.LargeTable)
SELECT KeyVal, DataVal,
CASE WHEN DataVal % 2 = 0
THEN DATEDIFF(day, YearsAdded, BaseDate)
- DATEDIFF(month, YearsAdded, BaseDate)
ELSE (100 + DATEDIFF(week, BaseDate, YearsAdded))
* (DATEDIFF(week, BaseDate, YearsAdded) - 100)
END
FROM MyCTE
WHERE KeyVal <= 100;
The query with the logic inlined may not look pretty – but if I execute the query for all ten million rows in the table (again using the MAX aggregate to reduce I/O and rendering overhead), the performance difference makes it worth it: over a minute for the UDF version, versus less than 4 seconds for the inlined version!
What’s next?
In the next part of this series, I’ll look at how data access in a scalar UDF changes the situation – to the worse! After that, I’ll check the two types of table-valued functions. Where scalar UDFs are good, table-valued UDFs can be downright ugly – or they can be good, depending on the type. I will also present a way to replace scalar functions with inline table-valued functions, so that you can encapsulate and reuse code without paying the gigantic performance penalty of scalar UDFs – but at the cost of complicating your query. So stay tuned!
转自 http://sqlblog.com/blogs/hugo_kornelis/archive/2012/05/20/t-sql-user-defined-functions-the-good-the-bad-and-the-ugly-part-1.aspx