【转】Split strings the right way – or the next best way
2013-09-06 17:24 quanhb 阅读(308) 评论(0) 编辑 收藏 举报
I know many people are bored of the string splitting problem, but it still seems to come up almost daily on forum and Q & A sites like StackOverflow. This is the problem where people want to pass in a string like this:
EXEC dbo.UpdateProfile @UserID = 1, @FavoriteTeams = N'Patriots,Red Sox,Bruins';
Inside the procedure, they want to do something like this:
INSERT dbo.UserTeams(UserID, TeamID) SELECT @UserID, TeamID FROM dbo.Teams WHERE TeamName IN (@FavoriteTeams);
This doesn't work because @FavoriteTeams is a single string, and the above translates to:
INSERT dbo.UserTeams(UserID, TeamID) SELECT @UserID, TeamID FROM dbo.Teams WHERE TeamName IN (N'Patriots,Red Sox,Bruins');
SQL Server is therefore going to try to find a team named Patriots,Red Sox,Bruins, and I'm guessing there is no such team. What they really want here is the equivalent of:
INSERT dbo.UserTeams(UserID, TeamID) SELECT @UserID, TeamID FROM dbo.Teams WHERE TeamName IN (N'Patriots', N'Red Sox', N'Bruins');
But since there is no array type in SQL Server, this is not how the variable is interpreted at all – it's still a simple, single string that happens to contain some commas. Questionable schema design aside, in this case the comma-separated list needs to be "split" into individual values – and this is the question that frequently spurs a lot of "new" debate and commentary about the best solution to achieve just that.
The answer seems to be, almost invariably, that you should use CLR. If you can't use CLR – and I know there are many of you out there who can't, due to corporate policy, the pointy-haired boss, or stubbornness – then you use one of the many workarounds that exist. And many workarounds exist.
But which one should you use?
I'm going to compare the performance of a few solutions – and focus on the question everyone always asks: "Which is fastest?" I'm not going to belabor the discussion around *all* of the potential methods, because several have already been eliminated due to the fact that they simply don't scale. And I may re-visit this in the future to examine the impact on other metrics, but for now I'm just going to focus on duration. Here are the contenders I am going to compare (using SQL Server 2012, 11.00.2316, on a Windows 7 VM with 4 CPUs and 8 GB of RAM):
CLR
If you wish to use CLR, you should definitely borrow code from fellow MVP Adam Machanic before thinking about writing your own (I've blogged before about re-inventing the wheel, and it also applies to free code snippets like this). He spent a lot of time fine-tuning this CLR function to efficiently parse a string. If you are currently using a CLR function and this is not it, I strongly recommend you deploy it and compare – I tested it against a much simpler, VB-based CLR routine that was functionally equivalent, but performed about three times worse.
So I took Adam's function, compiled the code to a DLL (using csc), and deployed just that file to the server. Then I added the following assembly and function to my database:
CREATE ASSEMBLY CLRUtilities FROM 'c:\DLLs\CLRUtilities.dll' WITH PERMISSION_SET = SAFE; GO CREATE FUNCTION dbo.SplitStrings_CLR ( @List NVARCHAR(MAX), @Delimiter NVARCHAR(255) ) RETURNS TABLE ( Item NVARCHAR(4000) ) EXTERNAL NAME CLRUtilities.UserDefinedFunctions.SplitString_Multi; GO
XML
This is the typical function I use for one-off scenarios where I know the input is "safe," but is not one I recommend for production environments (more on that below).
CREATE FUNCTION dbo.SplitStrings_XML ( @List NVARCHAR(MAX), @Delimiter NVARCHAR(255) ) RETURNS TABLE WITH SCHEMABINDING AS RETURN ( SELECT Item = y.i.value('(./text())[1]', 'nvarchar(4000)') FROM ( SELECT x = CONVERT(XML, '<i>' + REPLACE(@List, @Delimiter, '</i><i>') + '</i>').query('.') ) AS a CROSS APPLY x.nodes('i') AS y(i) ); GO
A very strong caveat has to ride along with the XML approach: it can only be used if you can guarantee that your input string does not contain any illegal XML characters. One name with <, > or & and the function will blow up. So regardless of the performance, if you're going to use this approach, be aware of the limitations – it should not be considered a viable option for a generic string splitter. I'm including it in this round-up because you may have a case where you can trust the input – for example it is possible to use for comma-separated lists of integers or GUIDs.
Numbers table
This solution uses a Numbers table, which you must build and populate yourself. (We've been requesting a built-in version for ages.) The Numbers table should contain enough rows to exceed the length of the longest string you'll be splitting. In this case we'll use 1,000,000 rows:
SET NOCOUNT ON; DECLARE @UpperLimit INT = 1000000; WITH n AS ( SELECT x = ROW_NUMBER() OVER (ORDER BY s1.[object_id]) FROM sys.all_objects AS s1 CROSS JOIN sys.all_objects AS s2 CROSS JOIN sys.all_objects AS s3 ) SELECT Number = x INTO dbo.Numbers FROM n WHERE x BETWEEN 1 AND @UpperLimit; GO CREATE UNIQUE CLUSTERED INDEX n ON dbo.Numbers(Number) WITH (DATA_COMPRESSION = PAGE); GO
(Using data compression will drastically reduce the number of pages required, but obviously you should only use this option if you are running Enterprise Edition. In this case the compressed data requires 1,360 pages, versus 2,102 pages without compression – about a 35% savings.)
CREATE FUNCTION dbo.SplitStrings_Numbers ( @List NVARCHAR(MAX), @Delimiter NVARCHAR(255) ) RETURNS TABLE WITH SCHEMABINDING AS RETURN ( SELECT Item = SUBSTRING(@List, Number, CHARINDEX(@Delimiter, @List + @Delimiter, Number) - Number) FROM dbo.Numbers WHERE Number <= CONVERT(INT, LEN(@List)) AND SUBSTRING(@Delimiter + @List, Number, LEN(@Delimiter)) = @Delimiter ); GO
Common Table Expression
This solution uses a recursive CTE to extract each part of the string from the "remainder" of the previous part. As a recursive CTE with local variables, you'll note that this had to be a multi-statement table-valued function, unlike the others which are all inline.
CREATE FUNCTION dbo.SplitStrings_CTE ( @List NVARCHAR(MAX), @Delimiter NVARCHAR(255) ) RETURNS @Items TABLE (Item NVARCHAR(4000)) WITH SCHEMABINDING AS BEGIN DECLARE @ll INT = LEN(@List) + 1, @ld INT = LEN(@Delimiter); WITH a AS ( SELECT [start] = 1, [end] = COALESCE(NULLIF(CHARINDEX(@Delimiter, @List, @ld), 0), @ll), [value] = SUBSTRING(@List, 1, COALESCE(NULLIF(CHARINDEX(@Delimiter, @List, @ld), 0), @ll) - 1) UNION ALL SELECT [start] = CONVERT(INT, [end]) + @ld, [end] = COALESCE(NULLIF(CHARINDEX(@Delimiter, @List, [end] + @ld), 0), @ll), [value] = SUBSTRING(@List, [end] + @ld, COALESCE(NULLIF(CHARINDEX(@Delimiter, @List, [end] + @ld), 0), @ll)-[end]-@ld) FROM a WHERE [end] < @ll ) INSERT @Items SELECT [value] FROM a WHERE LEN([value]) > 0 OPTION (MAXRECURSION 0); RETURN; END GO
Jeff Moden's splitter
Over on SQLServerCentral, Jeff Moden presented a splitter function that rivaled the performance of CLR, so I thought it only fair to include it in this round-up. I had to make a few minor changes to his function in order to handle our longest string (500,000 characters), and also made the naming conventions similar:
CREATE FUNCTION dbo.SplitStrings_Moden ( @List NVARCHAR(MAX), @Delimiter NVARCHAR(255) ) RETURNS TABLE WITH SCHEMABINDING AS RETURN WITH E1(N) AS ( SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1), E2(N) AS (SELECT 1 FROM E1 a, E1 b), E4(N) AS (SELECT 1 FROM E2 a, E2 b), E42(N) AS (SELECT 1 FROM E4 a, E2 b), cteTally(N) AS (SELECT 0 UNION ALL SELECT TOP (DATALENGTH(ISNULL(@List,1))) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E42), cteStart(N1) AS (SELECT t.N+1 FROM cteTally t WHERE (SUBSTRING(@List,t.N,1) = @Delimiter OR t.N = 0)) SELECT Item = SUBSTRING(@List, s.N1, ISNULL(NULLIF(CHARINDEX(@Delimiter,@List,s.N1),0)-s.N1,8000)) FROM cteStart s;
As an aside, for those using Jeff Moden's solution, you may consider using a Numbers table as above, and experimenting with a slight variation on Jeff's function:
CREATE FUNCTION dbo.SplitStrings_Moden2 ( @List NVARCHAR(MAX), @Delimiter NVARCHAR(255) ) RETURNS TABLE WITH SCHEMABINDING AS RETURN WITH cteTally(N) AS ( SELECT TOP (DATALENGTH(ISNULL(@List,1))+1) Number-1 FROM dbo.Numbers ORDER BY Number ), cteStart(N1) AS ( SELECT t.N+1 FROM cteTally t WHERE (SUBSTRING(@List,t.N,1) = @Delimiter OR t.N = 0) ) SELECT Item = SUBSTRING(@List, s.N1, ISNULL(NULLIF(CHARINDEX(@Delimiter, @List, s.N1), 0) - s.N1, 8000)) FROM cteStart AS s;
(This will trade slightly higher reads for slightly lower CPU, so may be better depending on whether your system is already CPU- or I/O-bound.)
Sanity checking
Just to be sure we're on the right track, we can verify that all five functions return the expected results:
DECLARE @s NVARCHAR(MAX) = N'Patriots,Red Sox,Bruins'; SELECT Item FROM dbo.SplitStrings_CLR (@s, N','); SELECT Item FROM dbo.SplitStrings_XML (@s, N','); SELECT Item FROM dbo.SplitStrings_Numbers (@s, N','); SELECT Item FROM dbo.SplitStrings_CTE (@s, N','); SELECT Item FROM dbo.SplitStrings_Moden (@s, N',');
And in fact, these are the results we see in all five cases…
The Test Data
Now that we know the functions behave as expected, we can get to the fun part: testing performance against various numbers of strings that vary in length. But first we need a table. I created the following simple object:
CREATE TABLE dbo.strings ( string_type TINYINT, string_value NVARCHAR(MAX) ); CREATE CLUSTERED INDEX st ON dbo.strings(string_type);
I populated this table with a set of strings of varying lengths, making sure that roughly the same set of data would be used for each test – first 10,000 rows where the string is 50 characters long, then 1,000 rows where the string is 500 characters long, 100 rows where the string is 5,000 characters long, 10 rows where the string is 50,000 characters long, and so on up to 1 row of 500,000 characters. I did this both to compare the same amount of overall data being processed by the functions, as well as to try to keep my testing times somewhat predictable.
I use a #temp table so that I can simply use GO <constant> to execute each batch a specific number of times:
SET NOCOUNT ON; GO CREATE TABLE #x(s NVARCHAR(MAX)); INSERT #x SELECT N'a,id,xyz,abcd,abcde,sa,foo,bar,mort,splunge,bacon,'; GO INSERT dbo.strings SELECT 1, s FROM #x; GO 10000 INSERT dbo.strings SELECT 2, REPLICATE(s,10) FROM #x; GO 1000 INSERT dbo.strings SELECT 3, REPLICATE(s,100) FROM #x; GO 100 INSERT dbo.strings SELECT 4, REPLICATE(s,1000) FROM #x; GO 10 INSERT dbo.strings SELECT 5, REPLICATE(s,10000) FROM #x; GO DROP TABLE #x; GO -- then to clean up the trailing comma, since some approaches treat a trailing empty string as a valid element: UPDATE dbo.strings SET string_value = SUBSTRING(string_value, 1, LEN(string_value)-1) + 'x';
Creating and populating this table took about 20 seconds on my machine, and the table represents about 6 MB worth of data (about 500,000 characters times 2 bytes, or 1 MB per string_type, plus row and index overhead). Not a huge table, but it should be large enough to highlight any differences in performance between the functions.
The Tests
With the functions in place, and the table properly stuffed with big strings to chew on, we can finally run some actual tests to see how the different functions perform against real data. In order to measure performance without factoring in network overhead, I used SQL Sentry Plan Explorer, running each set of tests 10 times, collecting the duration metrics, and averaging.
The first test simply pulled the items from each string as a set:
DBCC DROPCLEANBUFFERS; DBCC FREEPROCCACHE; DECLARE @string_type INT = <string_type>; -- 1-5 from above SELECT t.Item FROM dbo.strings AS s CROSS APPLY dbo.SplitStrings_<method>(s.string_value, ',') AS t WHERE s.string_type = @string_type;
The results show that as the strings get larger, the advantage of CLR really shines. At the lower end, the results were mixed, but again the XML method should have an asterisk next to it, since its use depends on relying on XML-safe input. For this specific use case, the Numbers table consistently performed the worst:
After the hyperbolic 40-second performance for the numbers table against 10 rows of 50,000 characters, I dropped it from the running for the last test. To better show the relative performance of the four best methods in this test, I've dropped the Numbers results from the graph altogether:
Next, let's compare when we perform a search against the comma-separated value (e.g. return the rows where one of the strings is 'foo'). Again we'll use the five functions above, but we'll also compare the result against a search performed at runtime using LIKE instead of bothering with splitting.
DBCC DROPCLEANBUFFERS; DBCC FREEPROCCACHE; DECLARE @i INT = <string_type>, @search NVARCHAR(32) = N'foo'; ;WITH s(st, sv) AS ( SELECT string_type, string_value FROM dbo.strings AS s WHERE string_type = @i ) SELECT s.string_type, s.string_value FROM s CROSS APPLY dbo.SplitStrings_<method>(s.sv, ',') AS t WHERE t.Item = @search; SELECT s.string_type FROM dbo.strings WHERE string_type = @i AND ',' + string_value + ',' LIKE '%,' + @search + ',%';
These results show that, for small strings, CLR was actually the slowest, and that the best solution is going to be performing a scan using LIKE, without bothering to split the data up at all. Again I dropped the Numbers table solution from the 5th approach, when it was clear that its duration would increase exponentially as the size of the string went up:
And to better demonstrate the patterns for the top 4 results, I've eliminated the Numbers and XML solutions from the graph:
Next, let's look at replicating the use case from the beginning of this post, where we're trying to find all the rows in one table that exist in the list being passed in. As with the data in the table we created above, we're going to create strings varying in length from 50 to 500,000 characters, store them in a variable, and then check a common catalog view for existing in the list.
DECLARE @i INT = <num>, -- value 1-5, yielding strings 50 - 500,000 characters @x NVARCHAR(MAX) = N'a,id,xyz,abcd,abcde,sa,foo,bar,mort,splunge,bacon,'; SET @x = REPLICATE(@x, POWER(10, @i-1)); SET @x = SUBSTRING(@x, 1, LEN(@x)-1) + 'x'; SELECT c.[object_id] FROM sys.all_columns AS c WHERE EXISTS ( SELECT 1 FROM dbo.SplitStrings_<method>(@x, N',') AS x WHERE Item = c.name ) ORDER BY c.[object_id]; SELECT [object_id] FROM sys.all_columns WHERE N',' + @x + ',' LIKE N'%,' + name + ',%' ORDER BY [object_id];
These results show that, for this pattern, several methods see their duration increase exponentially as the size of the string goes up. At the lower end, XML keeps good pace with CLR, but this quickly deteriorates as well. CLR is consistently the clear winner here:
And again without the methods that explode upward in terms of duration:
Finally, let's compare the cost of retrieving the data from a single variable of varying length, ignoring the cost of reading data from a table. Again we'll generate strings of varying length, from 50 – 500,000 characters, and then just return the values as a set:
DECLARE @i INT = <num>, -- value 1-5, yielding strings 50 - 500,000 characters @x NVARCHAR(MAX) = N'a,id,xyz,abcd,abcde,sa,foo,bar,mort,splunge,bacon,'; SET @x = REPLICATE(@x, POWER(10, @i-1)); SET @x = SUBSTRING(@x, 1, LEN(@x)-1) + 'x'; SELECT Item FROM dbo.SplitStrings_<method>(@x, N',');
These results also show that CLR is fairly flat-lined in terms of duration, all the way up to 110,000 items in the set, while the other methods keep decent pace until some time after 11,000 items:
Conclusion
In almost all cases, the CLR solution clearly out-performs the other approaches – in some cases it's a landslide victory, especially as string sizes increase; in a few others, it's a photo finish that could fall either way. In the first test we saw that XML and CTE out-performed CLR at the low end, so if this is a typical use case *and* you are sure that your strings are in the 1 – 10,000 character range, one of those approaches might be a better option. If your string sizes are less predictable than that, CLR is probably still your best bet overall – you lose a few milliseconds at the low end, but you gain a whole lot at the high end. Here are the choices I would make, depending on the task, with second place highlighted for cases where CLR is not an option. Note that XML is my preferred method only if I know the input is XML-safe; these may not necessarily be your best alternatives if you have less faith in your input.
The only real exception where CLR is not my choice across the board is the case where you're actually storing comma-separated lists in a table, and then finding rows where a defined entity is in that list. In that specific case, I would probably first recommend redesigning and properly normalizing the schema, so that those values are stored separately, rather than using it as an excuse to not use CLR for splitting.
If you can't use CLR for other reasons, there isn't a clear-cut "second place" revealed by these tests; my answers above were based on overall scale and not at any specific string size. Every solution here was runner up in at least one scenario – so while CLR is clearly the choice when you can use it, what you should use when you cannot is more of an "it depends" answer – you'll need to judge based on your use case(s) and the tests above (or by constructing your own tests) which alternative is better for you.
Addendum : An alternative to splitting in the first place
The above approaches require no changes to your existing application(s), assuming they are already assembling a comma-separated string and throwing it at the database to deal with. One option you should consider, if either CLR is not an option and/or you can modify the application(s), is using Table-Valued Parameters (TVPs). Here is a quick example of how to utilize a TVP in the above context. First, create a table type with a single string column:
CREATE TYPE dbo.Items AS TABLE ( Item NVARCHAR(4000) );
Then the stored procedure can take this TVP as input, and join on the content (or use it in other ways – this is just one example):
CREATE PROCEDURE dbo.UpdateProfile @UserID INT, @TeamNames dbo.Items READONLY AS BEGIN SET NOCOUNT ON; INSERT dbo.UserTeams(UserID, TeamID) SELECT @UserID, t.TeamID FROM dbo.Teams AS t INNER JOIN @TeamNames AS tn ON t.Name = tn.Item; END GO
Now in your C# code, for example, instead of building a comma-separated string, populate a DataTable (or use whatever compatible collection might already hold your set of values):
DataTable tvp = new DataTable(); tvp.Columns.Add(new DataColumn("Item")); // in a loop from a collection, presumably: tvp.Rows.Add(someThing.someValue); using (connectionObject) { SqlCommand cmd = new SqlCommand("dbo.UpdateProfile", connectionObject); cmd.CommandType = CommandType.StoredProcedure; SqlParameter tvparam = cmd.Parameters.AddWithValue("@TeamNames", tvp); tvparam.SqlDbType = SqlDbType.Structured; // other parameters, e.g. userId cmd.ExecuteNonQuery(); }
You might consider this to be a prequel to a follow-up post.
Of course this doesn't play well with JSON and other APIs – quite often the reason a comma-separated string is being passed to SQL Server in the first place.
本文转自:http://www.sqlperformance.com/2012/07/t-sql-queries/split-strings