【转】Split strings the right way – or the next best way
I know many people are bored of the string splitting problem, but it still seems to come up almost daily on forum and Q & A sites like StackOverflow. This is the problem where people want to pass in a string like this:
EXEC dbo.UpdateProfile @UserID = 1, @FavoriteTeams = N'Patriots,Red Sox,Bruins';
Inside the procedure, they want to do something like this:
INSERT dbo.UserTeams(UserID, TeamID) SELECT @UserID, TeamID
FROM dbo.Teams WHERE TeamName IN (@FavoriteTeams);
This doesn't work because @FavoriteTeams is a single string, and the above translates to:
INSERT dbo.UserTeams(UserID, TeamID) SELECT @UserID, TeamID
FROM dbo.Teams WHERE TeamName IN (N'Patriots,Red Sox,Bruins');
SQL Server is therefore going to try to find a team named Patriots,Red Sox,Bruins, and I'm guessing there is no such team. What they really want here is the equivalent of:
INSERT dbo.UserTeams(UserID, TeamID) SELECT @UserID, TeamID
FROM dbo.Teams WHERE TeamName IN (N'Patriots', N'Red Sox', N'Bruins');
But since there is no array type in SQL Server, this is not how the variable is interpreted at all – it's still a simple, single string that happens to contain some commas. Questionable schema design aside, in this case the comma-separated list needs to be "split" into individual values – and this is the question that frequently spurs a lot of "new" debate and commentary about the best solution to achieve just that.
The answer seems to be, almost invariably, that you should use CLR. If you can't use CLR – and I know there are many of you out there who can't, due to corporate policy, the pointy-haired boss, or stubbornness – then you use one of the many workarounds that exist. And many workarounds exist.
But which one should you use?
I'm going to compare the performance of a few solutions – and focus on the question everyone always asks: "Which is fastest?" I'm not going to belabor the discussion around *all* of the potential methods, because several have already been eliminated due to the fact that they simply don't scale. And I may re-visit this in the future to examine the impact on other metrics, but for now I'm just going to focus on duration. Here are the contenders I am going to compare (using SQL Server 2012, 11.00.2316, on a Windows 7 VM with 4 CPUs and 8 GB of RAM):
CLR
If you wish to use CLR, you should definitely borrow code from fellow MVP Adam Machanic before thinking about writing your own (I've blogged before about re-inventing the wheel, and it also applies to free code snippets like this). He spent a lot of time fine-tuning this CLR function to efficiently parse a string. If you are currently using a CLR function and this is not it, I strongly recommend you deploy it and compare – I tested it against a much simpler, VB-based CLR routine that was functionally equivalent, but performed about three times worse.
So I took Adam's function, compiled the code to a DLL (using csc), and deployed just that file to the server. Then I added the following assembly and function to my database:
CREATE ASSEMBLY CLRUtilities FROM 'c:\DLLs\CLRUtilities.dll'
WITH PERMISSION_SET = SAFE;
GO CREATE FUNCTION dbo.SplitStrings_CLR
(
@List NVARCHAR(MAX),
@Delimiter NVARCHAR(255)
)
RETURNS TABLE ( Item NVARCHAR(4000) )
EXTERNAL NAME CLRUtilities.UserDefinedFunctions.SplitString_Multi;
GO
XML
This is the typical function I use for one-off scenarios where I know the input is "safe," but is not one I recommend for production environments (more on that below).
CREATE FUNCTION dbo.SplitStrings_XML
(
@List NVARCHAR(MAX),
@Delimiter NVARCHAR(255)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
(
SELECT Item = y.i.value('(./text())[1]', 'nvarchar(4000)')
FROM
(
SELECT x = CONVERT(XML, '<i>'
+ REPLACE(@List, @Delimiter, '</i><i>')
+ '</i>').query('.')
) AS a CROSS APPLY x.nodes('i') AS y(i)
);
GO
A very strong caveat has to ride along with the XML approach: it can only be used if you can guarantee that your input string does not contain any illegal XML characters. One name with <, > or & and the function will blow up. So regardless of the performance, if you're going to use this approach, be aware of the limitations – it should not be considered a viable option for a generic string splitter. I'm including it in this round-up because you may have a case where you can trust the input – for example it is possible to use for comma-separated lists of integers or GUIDs.
Numbers table
This solution uses a Numbers table, which you must build and populate yourself. (We've been requesting a built-in version for ages.) The Numbers table should contain enough rows to exceed the length of the longest string you'll be splitting. In this case we'll use 1,000,000 rows:
SET NOCOUNT ON; DECLARE @UpperLimit INT = 1000000; WITH n AS
(
SELECT
x = ROW_NUMBER() OVER (ORDER BY s1.[object_id])
FROM sys.all_objects AS s1
CROSS JOIN sys.all_objects AS s2
CROSS JOIN sys.all_objects AS s3
)
SELECT Number = x
INTO dbo.Numbers
FROM n
WHERE x BETWEEN 1 AND @UpperLimit; GO
CREATE UNIQUE CLUSTERED INDEX n ON dbo.Numbers(Number)
WITH (DATA_COMPRESSION = PAGE);
GO
(Using data compression will drastically reduce the number of pages required, but obviously you should only use this option if you are running Enterprise Edition. In this case the compressed data requires 1,360 pages, versus 2,102 pages without compression – about a 35% savings.)
CREATE FUNCTION dbo.SplitStrings_Numbers
(
@List NVARCHAR(MAX),
@Delimiter NVARCHAR(255)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
(
SELECT Item = SUBSTRING(@List, Number,
CHARINDEX(@Delimiter, @List + @Delimiter, Number) - Number)
FROM dbo.Numbers
WHERE Number <= CONVERT(INT, LEN(@List))
AND SUBSTRING(@Delimiter + @List, Number, LEN(@Delimiter)) = @Delimiter
);
GO
Common Table Expression
This solution uses a recursive CTE to extract each part of the string from the "remainder" of the previous part. As a recursive CTE with local variables, you'll note that this had to be a multi-statement table-valued function, unlike the others which are all inline.
CREATE FUNCTION dbo.SplitStrings_CTE
(
@List NVARCHAR(MAX),
@Delimiter NVARCHAR(255)
)
RETURNS @Items TABLE (Item NVARCHAR(4000))
WITH SCHEMABINDING
AS
BEGIN
DECLARE @ll INT = LEN(@List) + 1, @ld INT = LEN(@Delimiter); WITH a AS
(
SELECT
[start] = 1,
[end] = COALESCE(NULLIF(CHARINDEX(@Delimiter,
@List, @ld), 0), @ll),
[value] = SUBSTRING(@List, 1,
COALESCE(NULLIF(CHARINDEX(@Delimiter,
@List, @ld), 0), @ll) - 1)
UNION ALL
SELECT
[start] = CONVERT(INT, [end]) + @ld,
[end] = COALESCE(NULLIF(CHARINDEX(@Delimiter,
@List, [end] + @ld), 0), @ll),
[value] = SUBSTRING(@List, [end] + @ld,
COALESCE(NULLIF(CHARINDEX(@Delimiter,
@List, [end] + @ld), 0), @ll)-[end]-@ld)
FROM a
WHERE [end] < @ll
)
INSERT @Items SELECT [value]
FROM a
WHERE LEN([value]) > 0
OPTION (MAXRECURSION 0); RETURN;
END
GO
Jeff Moden's splitter
Over on SQLServerCentral, Jeff Moden presented a splitter function that rivaled the performance of CLR, so I thought it only fair to include it in this round-up. I had to make a few minor changes to his function in order to handle our longest string (500,000 characters), and also made the naming conventions similar:
CREATE FUNCTION dbo.SplitStrings_Moden
(
@List NVARCHAR(MAX),
@Delimiter NVARCHAR(255)
)
RETURNS TABLE
WITH SCHEMABINDING AS
RETURN
WITH E1(N) AS ( SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1),
E2(N) AS (SELECT 1 FROM E1 a, E1 b),
E4(N) AS (SELECT 1 FROM E2 a, E2 b),
E42(N) AS (SELECT 1 FROM E4 a, E2 b),
cteTally(N) AS (SELECT 0 UNION ALL SELECT TOP (DATALENGTH(ISNULL(@List,1)))
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E42),
cteStart(N1) AS (SELECT t.N+1 FROM cteTally t
WHERE (SUBSTRING(@List,t.N,1) = @Delimiter OR t.N = 0))
SELECT Item = SUBSTRING(@List, s.N1, ISNULL(NULLIF(CHARINDEX(@Delimiter,@List,s.N1),0)-s.N1,8000))
FROM cteStart s;
As an aside, for those using Jeff Moden's solution, you may consider using a Numbers table as above, and experimenting with a slight variation on Jeff's function:
CREATE FUNCTION dbo.SplitStrings_Moden2
(
@List NVARCHAR(MAX),
@Delimiter NVARCHAR(255)
)
RETURNS TABLE
WITH SCHEMABINDING AS
RETURN
WITH cteTally(N) AS
(
SELECT TOP (DATALENGTH(ISNULL(@List,1))+1) Number-1
FROM dbo.Numbers ORDER BY Number
),
cteStart(N1) AS
(
SELECT t.N+1
FROM cteTally t
WHERE (SUBSTRING(@List,t.N,1) = @Delimiter OR t.N = 0)
)
SELECT Item = SUBSTRING(@List, s.N1,
ISNULL(NULLIF(CHARINDEX(@Delimiter, @List, s.N1), 0) - s.N1, 8000))
FROM cteStart AS s;
(This will trade slightly higher reads for slightly lower CPU, so may be better depending on whether your system is already CPU- or I/O-bound.)
Sanity checking
Just to be sure we're on the right track, we can verify that all five functions return the expected results:
DECLARE @s NVARCHAR(MAX) = N'Patriots,Red Sox,Bruins'; SELECT Item FROM dbo.SplitStrings_CLR (@s, N',');
SELECT Item FROM dbo.SplitStrings_XML (@s, N',');
SELECT Item FROM dbo.SplitStrings_Numbers (@s, N',');
SELECT Item FROM dbo.SplitStrings_CTE (@s, N',');
SELECT Item FROM dbo.SplitStrings_Moden (@s, N',');
And in fact, these are the results we see in all five cases…
The Test Data
Now that we know the functions behave as expected, we can get to the fun part: testing performance against various numbers of strings that vary in length. But first we need a table. I created the following simple object:
CREATE TABLE dbo.strings
(
string_type TINYINT,
string_value NVARCHAR(MAX)
); CREATE CLUSTERED INDEX st ON dbo.strings(string_type);
I populated this table with a set of strings of varying lengths, making sure that roughly the same set of data would be used for each test – first 10,000 rows where the string is 50 characters long, then 1,000 rows where the string is 500 characters long, 100 rows where the string is 5,000 characters long, 10 rows where the string is 50,000 characters long, and so on up to 1 row of 500,000 characters. I did this both to compare the same amount of overall data being processed by the functions, as well as to try to keep my testing times somewhat predictable.
I use a #temp table so that I can simply use GO <constant> to execute each batch a specific number of times:
SET NOCOUNT ON;
GO
CREATE TABLE #x(s NVARCHAR(MAX));
INSERT #x SELECT N'a,id,xyz,abcd,abcde,sa,foo,bar,mort,splunge,bacon,';
GO
INSERT dbo.strings SELECT 1, s FROM #x;
GO 10000
INSERT dbo.strings SELECT 2, REPLICATE(s,10) FROM #x;
GO 1000
INSERT dbo.strings SELECT 3, REPLICATE(s,100) FROM #x;
GO 100
INSERT dbo.strings SELECT 4, REPLICATE(s,1000) FROM #x;
GO 10
INSERT dbo.strings SELECT 5, REPLICATE(s,10000) FROM #x;
GO
DROP TABLE #x;
GO -- then to clean up the trailing comma, since some approaches treat a trailing empty string as a valid element:
UPDATE dbo.strings SET string_value = SUBSTRING(string_value, 1, LEN(string_value)-1) + 'x';
Creating and populating this table took about 20 seconds on my machine, and the table represents about 6 MB worth of data (about 500,000 characters times 2 bytes, or 1 MB per string_type, plus row and index overhead). Not a huge table, but it should be large enough to highlight any differences in performance between the functions.
The Tests
With the functions in place, and the table properly stuffed with big strings to chew on, we can finally run some actual tests to see how the different functions perform against real data. In order to measure performance without factoring in network overhead, I used SQL Sentry Plan Explorer, running each set of tests 10 times, collecting the duration metrics, and averaging.
The first test simply pulled the items from each string as a set:
DBCC DROPCLEANBUFFERS;
DBCC FREEPROCCACHE; DECLARE @string_type INT = <string_type>; -- 1-5 from above SELECT t.Item FROM dbo.strings AS s
CROSS APPLY dbo.SplitStrings_<method>(s.string_value, ',') AS t
WHERE s.string_type = @string_type;
The results show that as the strings get larger, the advantage of CLR really shines. At the lower end, the results were mixed, but again the XML method should have an asterisk next to it, since its use depends on relying on XML-safe input. For this specific use case, the Numbers table consistently performed the worst:
Duration, in milliseconds
After the hyperbolic 40-second performance for the numbers table against 10 rows of 50,000 characters, I dropped it from the running for the last test. To better show the relative performance of the four best methods in this test, I've dropped the Numbers results from the graph altogether:
Next, let's compare when we perform a search against the comma-separated value (e.g. return the rows where one of the strings is 'foo'). Again we'll use the five functions above, but we'll also compare the result against a search performed at runtime using LIKE instead of bothering with splitting.
DBCC DROPCLEANBUFFERS;
DBCC FREEPROCCACHE; DECLARE @i INT = <string_type>, @search NVARCHAR(32) = N'foo'; ;WITH s(st, sv) AS
(
SELECT string_type, string_value
FROM dbo.strings AS s
WHERE string_type = @i
)
SELECT s.string_type, s.string_value FROM s
CROSS APPLY dbo.SplitStrings_<method>(s.sv, ',') AS t
WHERE t.Item = @search; SELECT s.string_type
FROM dbo.strings
WHERE string_type = @i
AND ',' + string_value + ',' LIKE '%,' + @search + ',%';
These results show that, for small strings, CLR was actually the slowest, and that the best solution is going to be performing a scan using LIKE, without bothering to split the data up at all. Again I dropped the Numbers table solution from the 5th approach, when it was clear that its duration would increase exponentially as the size of the string went up:
Duration, in milliseconds
And to better demonstrate the patterns for the top 4 results, I've eliminated the Numbers and XML solutions from the graph:
Next, let's look at replicating the use case from the beginning of this post, where we're trying to find all the rows in one table that exist in the list being passed in. As with the data in the table we created above, we're going to create strings varying in length from 50 to 500,000 characters, store them in a variable, and then check a common catalog view for existing in the list.
DECLARE
@i INT = <num>, -- value 1-5, yielding strings 50 - 500,000 characters
@x NVARCHAR(MAX) = N'a,id,xyz,abcd,abcde,sa,foo,bar,mort,splunge,bacon,'; SET @x = REPLICATE(@x, POWER(10, @i-1)); SET @x = SUBSTRING(@x, 1, LEN(@x)-1) + 'x'; SELECT c.[object_id]
FROM sys.all_columns AS c
WHERE EXISTS
(
SELECT 1 FROM dbo.SplitStrings_<method>(@x, N',') AS x
WHERE Item = c.name
)
ORDER BY c.[object_id]; SELECT [object_id]
FROM sys.all_columns
WHERE N',' + @x + ',' LIKE N'%,' + name + ',%'
ORDER BY [object_id];
These results show that, for this pattern, several methods see their duration increase exponentially as the size of the string goes up. At the lower end, XML keeps good pace with CLR, but this quickly deteriorates as well. CLR is consistently the clear winner here:
Duration, in milliseconds
And again without the methods that explode upward in terms of duration:
Finally, let's compare the cost of retrieving the data from a single variable of varying length, ignoring the cost of reading data from a table. Again we'll generate strings of varying length, from 50 – 500,000 characters, and then just return the values as a set:
DECLARE
@i INT = <num>, -- value 1-5, yielding strings 50 - 500,000 characters
@x NVARCHAR(MAX) = N'a,id,xyz,abcd,abcde,sa,foo,bar,mort,splunge,bacon,'; SET @x = REPLICATE(@x, POWER(10, @i-1)); SET @x = SUBSTRING(@x, 1, LEN(@x)-1) + 'x'; SELECT Item FROM dbo.SplitStrings_<method>(@x, N',');
These results also show that CLR is fairly flat-lined in terms of duration, all the way up to 110,000 items in the set, while the other methods keep decent pace until some time after 11,000 items:
Duration, in milliseconds
Conclusion
In almost all cases, the CLR solution clearly out-performs the other approaches – in some cases it's a landslide victory, especially as string sizes increase; in a few others, it's a photo finish that could fall either way. In the first test we saw that XML and CTE out-performed CLR at the low end, so if this is a typical use case *and* you are sure that your strings are in the 1 – 10,000 character range, one of those approaches might be a better option. If your string sizes are less predictable than that, CLR is probably still your best bet overall – you lose a few milliseconds at the low end, but you gain a whole lot at the high end. Here are the choices I would make, depending on the task, with second place highlighted for cases where CLR is not an option. Note that XML is my preferred method only if I know the input is XML-safe; these may not necessarily be your best alternatives if you have less faith in your input.
The only real exception where CLR is not my choice across the board is the case where you're actually storing comma-separated lists in a table, and then finding rows where a defined entity is in that list. In that specific case, I would probably first recommend redesigning and properly normalizing the schema, so that those values are stored separately, rather than using it as an excuse to not use CLR for splitting.
If you can't use CLR for other reasons, there isn't a clear-cut "second place" revealed by these tests; my answers above were based on overall scale and not at any specific string size. Every solution here was runner up in at least one scenario – so while CLR is clearly the choice when you can use it, what you should use when you cannot is more of an "it depends" answer – you'll need to judge based on your use case(s) and the tests above (or by constructing your own tests) which alternative is better for you.
Addendum : An alternative to splitting in the first place
The above approaches require no changes to your existing application(s), assuming they are already assembling a comma-separated string and throwing it at the database to deal with. One option you should consider, if either CLR is not an option and/or you can modify the application(s), is using Table-Valued Parameters (TVPs). Here is a quick example of how to utilize a TVP in the above context. First, create a table type with a single string column:
CREATE TYPE dbo.Items AS TABLE
(
Item NVARCHAR(4000)
);
Then the stored procedure can take this TVP as input, and join on the content (or use it in other ways – this is just one example):
CREATE PROCEDURE dbo.UpdateProfile
@UserID INT,
@TeamNames dbo.Items READONLY
AS
BEGIN
SET NOCOUNT ON; INSERT dbo.UserTeams(UserID, TeamID) SELECT @UserID, t.TeamID
FROM dbo.Teams AS t
INNER JOIN @TeamNames AS tn
ON t.Name = tn.Item;
END
GO
Now in your C# code, for example, instead of building a comma-separated string, populate a DataTable (or use whatever compatible collection might already hold your set of values):
DataTable tvp = new DataTable();
tvp.Columns.Add(new DataColumn("Item")); // in a loop from a collection, presumably:
tvp.Rows.Add(someThing.someValue); using (connectionObject)
{
SqlCommand cmd = new SqlCommand("dbo.UpdateProfile", connectionObject);
cmd.CommandType = CommandType.StoredProcedure;
SqlParameter tvparam = cmd.Parameters.AddWithValue("@TeamNames", tvp);
tvparam.SqlDbType = SqlDbType.Structured;
// other parameters, e.g. userId
cmd.ExecuteNonQuery();
}
You might consider this to be a prequel to a follow-up post.
Of course this doesn't play well with JSON and other APIs – quite often the reason a comma-separated string is being passed to SQL Server in the first place.
本文转自:http://www.sqlperformance.com/2012/07/t-sql-queries/split-strings
【转】Split strings the right way – or the next best way的更多相关文章
- caffe: test code for PETA dataset
test code for PETA datasets .... #ifdef WITH_PYTHON_LAYER #include "boost/python.hpp" name ...
- Java 中的泛型详解-Java编程思想
Java中的泛型参考了C++的模板,Java的界限是Java泛型的局限. 2.简单泛型 促成泛型出现最引人注目的一个原因就是为了创造容器类. 首先看一个只能持有单个对象的类,这个类可以明确指定其持有的 ...
- [爬虫]采用Go语言爬取天猫商品页面
最近工作中有一个需求,需要爬取天猫商品的信息,整个需求的过程如下: 修改后端广告交易平台的代码,从阿里上传的素材中解析url,该url格式如下: https://handycam.alicdn.com ...
- [Python] 00 - Books
A.I. & Optimization Advanced Machine Learning, Data Mining, and Online Advertising Services Ref: ...
- Java 泛型(Generics) 综述
一. 引子 一般的类和方法.仅仅能使用详细类型:要么是基本类型.要么是自己定义类型.假设要编写能够应用于多种类型的代码,这样的刻板的限制对代码的束缚就会非常大. 多态算是一种泛化机制,但对代码的约束还 ...
- golang学习总结
目录 1. 初识go语言 1.1 Hello World 1.2 go 数据类型 布尔: 整型: 浮点型: 字符类型 字符串型: 复数类型: 1.3 变量常量 局部变量: 全局变量 常量 1.5 字符 ...
- linux centos7安装phpMyAdmin详解,以及解决各种bug问题
使用php和mysql开发网站的话,phpmyadmin和navicat是目前非常好的mysql管理工具,但是phpmyadmin最主要是免费开源,目前很多集成的开发环境都会自带phpmyadmin, ...
- 344. Reverse String【easy】
344. Reverse String[easy] Write a function that takes a string as input and returns the string rever ...
- 15 使用lambdas和闭包
1 使用lambdas和闭包 1.1 定义闭包 闭包是一个代码块,代替了方法或类. groovy中的闭包基于后边的构造方式:{list of parameters-> closur ...
随机推荐
- UVA 11488-Hyper Prefix Sets(Trie)
题意: 给一个01串的集合,一个集合的幸运值是串的个数*集合中串的最大公共前缀 ,求所有子集中最大幸运值 分析: val[N]表示经过每个节点串的个数求幸运值 求就是每个节点值*该节点的深度 搜一遍树 ...
- 发送一个简单的HTTP GET请求并且取回响应。
string uri="http//www.baidu.com"; WebClient wc = new WebClient(); Console.WriteLine(" ...
- Anti-Grain Geometry 概述
AGG是一个轻量.灵活.可靠的图形算法库,AGG各部分之间是松耦合的,也即是说各部分可以单独使用. The primary goal of Anti-Grain Geometry is to brea ...
- 线性存储结构-Stack
Stack继承于Vector,是一个模拟堆栈结构的集合类.当然也属于顺序存储结构.这里注意Android在com.android.layoutlib.bridge.impl包中也有一个Stack的实现 ...
- [转载]su认证失败
Ubuntu 安装后,root用户默认是被锁定了的,不允许登录,也不允许 "su" 到 root.有人说这是个不好的实践,特别是对于服务器来说.我觉得对于桌面用户来说,这样安全性更 ...
- HIbernate学习笔记(七) hibernate中的集合映射和继承映射
九. 集合映射 1. Set 2. List a) @OrderBy 注意:List与Set注解是一样的,就是把Set更改为List就可以了 private List< ...
- Mysql的函数使用方法
今天有点临时需求要计算一张表的结果,不想写代码,想到了mysql的自定义函数.碰到了很多问题,为了方便一下使用,在此记录一下. 需求:一张表中,有比分,需要查询出比赛id和比赛结果. 分析: ...
- mssql触发器demo
USE [pos]GO/****** Object: Trigger [dbo].[tr_insert] Script Date: 06/26/2014 09:27:19 ******/SET ANS ...
- RC522天线匹配参数【worldsing笔记】
图为Device读卡器的参数值 EMC电路对读写距离影响不大: L3 和L4 固定为2.2uH: C11和C12也是固定值,如果P ...
- hibernate分页实现
1.创建分页实体类 public class PageBean { private int page; // 页码 private int rows; // 每页显示行数 private int st ...