T-SQL Tuesday # 16 : User defined Concat of SQLServer
This week, T-SQL Tuesday is being hosted by Jes Borland (blog | twitter), and the theme is "Aggregate Functions." When people think of aggregates, they tend to think of MAX(), SUM() and COUNT(). And occasionally, less common functions such as AVG() and STDEV(). I thought I would write a quick post about a different type of aggregate: string concatenation. Even going back to my classic ASP days, one of the more common questions out in the community has been, "how do I turn a column into a comma-separated list?" Over the years, I've used a variety of approaches to solve this problem. For this post I wanted to compare a few different methods, sticking to pure T-SQL and ignoring external solutions such as SQLCLR for brevity. I'm also going to use inline code instead of bothering to put anything into UDFs, so it's clear that function calling etc. is not interfering with the results. I'm sure there is nothing you haven't seen before, but having spotted several recent discussions where people profess things like, "a while loop is more efficient than a cursor," I thought it would be useful to take a quick glance at the performance of each of the common approaches. |
The Goal
Let's make the goal kind of silly: we want all of the names from msdb.sys.objects, but rather than multiple rows in a single column, we want the names in a single tuple, separated by columns. So instead of:
We want this:
A couple of things to note: we don't need to worry about NULL values in our concatenation, so there will be no special handling for that case, though it may be necessary in your scenario. I also don't need to worry about embedded commas in the data, because I know that none of my objects in MSDB are named using commas.
The Setup
I placed this code at the top of my script. I wanted a table variable to hold the object names, and some variables that would be re-used through the various string concatenation methods.
USE [msdb];
SET NOCOUNT ON;
DECLARE @foo TABLE
(
id INT IDENTITY(1,1)PRIMARY KEY,
name NVARCHAR(257)
);
INSERT @foo
SELECT DISTINCTn =OBJECT_SCHEMA_NAME([object_id]) + '.' + name
FROM sys.objects
ORDER BY n;
DECLARE
@s NVARCHAR(MAX),
@n NVARCHAR(257)
@loop INT =0,
@count INT =(SELECT COUNT(*) FROM @foo);
Approach #1a : An "Evil" Cursor
Very early on in my career, my approach would have been one of brute force: use a cursor. These days, when I see people compare a cursor to a while loop, I see them making a very unfair comparison; they use the default options for a cursor, which can be fairly heavyweight depending on concurrency and the rest of your workload. And I'll admit, in those early days, I often just blurted out "DECLARE c CURSOR FOR..." without any thoughts to options that I should be setting, such as READ_ONLY, FORWARD_ONLY, LOCAL, etc. I don't want this to blossom into a lesson on cursors and default settings vs. optimal settings (why GLOBAL is still the default, I'll never understand), but I will tell you that I now always use the following syntax:
DECLARE cursor_name CURSOR
LOCAL STATIC FORWARD_ONLY READ_ONLY
FOR ...
While it's true that FAST_FORWARD is essentially shorthand for FORWARD_ONLY and READ_ONLY, if you try to specify STATIC FAST_FORWARD, you get this error:
Msg 1048, Level 15, State 1, Line 2
Conflicting cursor options STATIC and FAST_FORWARD.
Hugo Kornelis does a great job of comparing the performance of all of the cursor options in this blog post. The two most interesting things I got from this article: (1) STATIC is faster than FAST_FORWARD, and (2) the default options will always result in the slowest possible performance (the latter I already knew, which is why it always irked me when people used the default options to show how slow cursors are). In spite of his assurances that there is no difference between LOCAL and GLOBAL or whether or not you use READ_ONLY, I am probably going to keep using them for documentation and future-proofing reasons.
That all said, here is how many people would use a default cursor to achieve the desired result:
DECLARE c CURSOR
FOR
SELECT name FROM @foo ORDER BY name;
OPEN c;
FETCH NEXTFROM c INTO @n;
WHILE @@FETCH_STATUS =0
BEGIN
SET @s =COALESCE(@s + ',' + @n, @n);
FETCH NEXTFROM c INTO @n;
END
CLOSE c;
DEALLOCATE c;
Approach #1b : An Optimized Cursor
The following cursor yields better performance than a default cursor, even if a few of the specified cursor options may truthfully have no impact.
DECLARE c CURSOR
LOCAL STATIC FORWARD_ONLY READ_ONLY
FOR
SELECT name FROM @foo ORDER BY name;
OPEN c;
FETCH NEXTFROM c INTO @n;
WHILE @@FETCH_STATUS =0
BEGIN
SET @s =COALESCE(@s + ',' + @n, @n);
FETCH NEXTFROM c INTO @n;
END
CLOSE c;
DEALLOCATE c;
I also see people suggesting that DEALLOCATE is sufficient, and that you don't need to CLOSE a cursor. For a long time, I would have agreed with them; however, it was pointed out to me last year (and I apologize, I tried to find the discussion, but failed), that you can actually observe some additional overhead if you fail to issue a CLOSE. I have plenty more to say about cursors, but I want to focus on the topic at hand, and will revisit these issues in a future blog post.
Approach #2 : A While Loop
A statement I hear a lot: "Don't use a cursor; use a while loop. They're faster." A loop is a loop; an iterative approach has the same performance limitations whether or not you explicitly use DECLARE CURSOR / FETCH. The while loop is a little tidier than the cursor, but relies on the IDENTITY column to iterate:
WHILE @loop < @count
BEGIN
SELECT TOP 1 @s =COALESCE(@s + ',' + name, name)
FROM @foo
WHERE id > @loop
ORDER BY id;
SET @loop +=1;
END
Approach #3 : An Undocumented / Unsupported Aggregation
This is probably the approach I've used the most in my career - it's essentially a cursor disguised as a set-based operation. I believe I first learned this approach from Anith Sen over a decade ago.
SELECT @s =COALESCE(@s + ',' + name, name)
FROM @foo
ORDER BY name;
Approach #4a : FOR XML PATH
When SQL Server 2005 was released, we were introduced to a much more powerful set of functionality around XML. One feature was the ability to convert a resultset into an XML string, and this was quickly exploited to join strings together without any XML tags at all, as follows:
;WITH x(x) AS
(
SELECT name AS [data()]
FROM @foo
ORDER BY name
FOR XML PATH('')
)
SELECT @s =REPLACE(x, ' ', ',')
FROM x;
Approach #4b : FOR XML PATH, TYPE
I picked up this approach from a comment by RBarryYoung on Adam Machanic's (blog | twitter) post from 2006 ("Rowset string concatenation: Which method is best?"). It is similar to the above but uses value() and TYPE for extracting the string values from the XML. I just couldn't figure out how to pull the values out with the comma separator without having to hack off the last comma.
SELECT @s =
(
SELECT name + ','
FROM @foo
ORDER BY name
FOR XML PATH(''), TYPE
).value('.[1]', 'NVARCHAR(MAX)');
SELECT @s =LEFT(@s, LEN(@s)-1);
Performance Test
I ran each of these code segments 1,000 times (script attached below). I also noted in Hugo's research that the situation can be even worse for the WHILE loop when the IDENTITY column is not the clustering key, so I tried the script again with the PRIMARY KEY on the name column of the table variable instead of the id column. Below are all of the results:
IDENTITY = PK name = PK Total Duration
(milliseconds)Average Duration
(milliseconds)Total Duration
(milliseconds)Average Duration
(milliseconds)1a : Default CURSOR 105,060 105.1 104,843 104.8 1b : Optimized CURSOR 84,710 84.7 85,013 85.0 2 : WHILE Loop 113,753 113.8 516,136 516.1 3 : Set-Based "Cursor" 47,640 47.6 46,750 46.8 4a : FOR XML PATH 7,113 7.1 6,753 6.8 4b : FOR XML PATH, TYPE 2,486 2.5 1,813 1.8
Conclusion
Clearly, at least in this specific scenario, the XML methods are far superior to any iterative approach. The disadvantage of the XML methods is that, like MERGE and other new syntax constructs, they are difficult to memorize and get right without referencing existing, working code. But if your goal is performance, the up-front and maintenance cost is going to be worth it in the long run. If you are working on a one-off, then the set-based concatenation (Approach #3) is likely a good trade-off. But in very single test, the WHILE loop is the worst performer. Perhaps there is some flawed logic in how I am constructing my loop; there are many ways to do it, this one just seemed the most intuitive and, more importantly, most efficient - at least among the typical WHILE loop methods.
I hope these results are something you keep in mind the next time you need to perform string concatenation in T-SQL, and you don't have the luxury of doing it elsewhere (e.g. in the application tier or in SQLCLR).
from http://sqlblog.com/blogs/aaron_bertrand/archive/2011/03/08/t-sql-tuesday-16-this-is-not-the-aggregate-you-re-looking-for.aspx
download the scripts TSQLTuesday_16.sql.zip