3 years later and STRING_SPLIT is still incomplete

It's finally (more or less) complete!

Well, it's now 2022, and we can finally say that STRING_SPLIT is "complete"; or at least better (not accepting multi-character delimiters is still a problem). In Azure SQL Database, and in SQL Server 2022, STRING_SPLIT now has an optional bit parameter, enable_ordinal, which if 1 is passed to will mean that an additional column called ordinal is returned, which (unsurprisingly) returns the ordinal position of the value in the delimited list. By default, the value is 0 for the parameter, so if you did have any production code with SELECT * (though I hope you don't) the change would not be a breaking change.

So you can now do something like the following, and your values will be returned in the correct order:

SELECT [value]
FROM STRING_SPLIT('a,c,b',',',1) SS
ORDER BY ordinal;

This will give you 3 rows, and the rows will be returned in the same order as the values in the delimited value.

You can, however, read by original rant from 2019 below:


Bit of a mini rant today I'm afraid. Many of us are aware that STRING_SPLIT was added in SQL Server 2016, and yet 3 years later it's still incomplete.

Why it's incomplete

There's two major problems, in my view, with STRING_SPLIT. The first is that it doesn't provide an ordinal position for the text extracted. For a string like 'B,D,C,A' this means you could just as easily get the rows A, B, C and D as you could B, D, C and A. When working with delimited data, knowing the ordinal position is often really important, and often a must.

I have seen people therefore use code like this, but would you really trust SQL Server to number the rows correctly based on chance?

SELECT sq.[value]
FROM (SELECT SS.[value],
             ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS RN     
      FROM STRING_SPLIT(@DelimitedString,',') SS) sq
ORDER BY sq.RN;

Nothing in that code compels ROW_NUMBER to number those rows in the order they were in @DelimitedString, it's purely chance if they do; just like when you run a query against a table without an ORDER BY and they rows happen to be in the same order each time. This is even noted it it's own documentation:

The order of the output may vary as the order is not guaranteed to match the order of the substrings in the input string.

The second problem is that the second parameter for STRING_SPLIT is an nchar(1)/nvarchar(1). A delimiter is not always a single character. Those of us that have been using SQL Server prior to 2016 SP1 will know that SQL Server did not support quoted identifiers in CSV files. This meant when choosing a column delimiter you have to be very careful to ensure you picked one that didn't appear in your data. This can be very difficult to do when your data can have a range of special characters so many instead chose to use 2 characters as a delimiter. A common one I found was the double pipe (||).

What's more odd is that STRING_AGG (introduced in SQL Server 2017) does support multiple characters for a delimiter. Thus, in my view, STRING_SPLIT not supporting a multiple character delimiter is a huge oversight. Take something that should be "quite" simple:

SELECT SS.[value]
FROM (SELECT STRING_AGG(V.C,'||') WITHIN GROUP (ORDER BY V.C) AS A
      FROM (VALUES('A'),('B'),('C'))V(C)) S
           CROSS APPLY STRING_SPLIT(S.A,'||') SS;

One would expect the above to work; you're using STRING_AGG to create a delimited list (delimited by ||) and then using STRING_SPLIT to split it all out again. Unfortunately, however, you'll be presented with the below error:

Msg 214, Level 16, State 11, Line 3
Procedure expects parameter 'separator' of type 'nchar(1)/nvarchar(1)'.

Yep, SQL Server failed to split a string it had aggregated itself. /facepalm

Alternatives

There are, however, plenty of alternatives out there. DelimitedSplit8k_LEAD is a great alternative to ensure you get the ordinal position when splitting. Natively it only supports a single delimiter character as well, but could be (quite) easily modified to use more. Another alternative is the "JSON Splitter", which is demonstrated in this answer on Stack Overflow. This would support both multiple character delimiters and provide an ordinal position.

Fix the data

The real problem, however, is that you have well be storing delimited data in your RDBMS. That there, is the real problem. A single value for a column and row should represent a single value (it is a single value after all). Storing delimited data comes with a number of flaws, and solves nothing. Good design, and many-to-one (or many-to-many) relationships fix the problem at hand. If you're at an early enough stage where you can fix the design, do so now. If you're already in production, then look at fixing it in the future; you'll be grateful, and so will your RDBMS, when you do.

Leave a Reply

Your email address will not be published. Required fields are marked *