Wednesday, February 3, 2021

SQL Server SUBSTRING

The Substring function, one of the most used string manipulation functions in all languages, is also available in SQL Server. Just like any function, the SQL Server Substring function takes some arguments in a specific order and returns a value. 

As you can easily guess from the name, the SQL Server Substring function returns a specific part of a given string, and that means it takes a string as an argument and returns part of that string. The question is which part of the original string will it return? That is specified by the other two arguments, the start position (start) of the sub-string to be returned and the number of characters to return (length).

So the function looks like this: SUBSTRING ( OriginalString, start, length )

For example, if we have a string like 'My preferred city in US is Boston' and we wish to extract the name of the city from that string we can write 

SUBSTRING (‘My preferred city in US is Boston’, 28, 6) 

and we would get back the substring ‘Boston’.

As you can imagine the name of the city in the above string can be anything and you do not necessarily know the length, so how can you make sure the substring function will extract the full name of the city from the original string and nothing else? 

Let us assume you know the city name starts at position 28 and it is the last word on the original string, so basically you just want to make sure you return everything from position 28 to the end. You also know that the city name cannot be longer than say 100 characters so you could do:

SUBSTRING ( OriginalString, 28, 100 )

What will this return when the OriginalString is 'My preferred city in US is New York'? Will this return an error since the 'length' we have specified goes outside the boundaries of the OriginalString? No, it will not return an error. If the “start + length” exceeds the length of the OriginalString than the SUBSTRING function will return everything from the start position to the end of the OriginalString. 

However, even though this would work, is not satisfying. It just does not feel right to pass an argument that we know it will be out of range. So, how do we fix that? How about this:

SUBSTRING ( OriginalString, 28, LEN ( OriginalString ) - 28 + 1 )

Let’s review this: as you can see we are using LEN ( OriginalString ) - 28 + 1 to specify the length argument of the Substring function. What does that expression do? It calculates the exact length of the city name (assuming the city name is everything from position 28 to the end of the OriginalString). LEN is a T-SQL function that will return the length of the argument we pass to it, in this case the length of the OriginalString. So if the OriginalString is 'My preferred city in US is New York' then LEN(OriginalString) will return the value 35. Knowing that the city name starts in position 28 we now know that the name of the city in this case would be exactly (35 - 28 + 1) = 8 characters long. So, the SUBSTRING expression above would be the equivalent of writing SUBSTRING ( OriginalString, 28, 8), but we have parametrized it so that it works for any city name regardless of its length. 

Now, let us make this slightly more interesting. Assume your original string contains the city name but it is somewhere in the middle of the OriginalString and you need to extract it. You review the data and notice that the city name is always preceded by the string 'city of ' and is always followed by ' has'. How can we pull out the city name? How about this:

SELECT SUBSTRING ( @OriginalString, CHARINDEX ( 'city of', @OriginalString ) + 8, CHARINDEX ( ' has', @OriginalString, CHARINDEX ('city of', @OriginalString ) ) - ( CHARINDEX ( 'city of', @OriginalString ) + 8 ) )

This may seem convoluted, but it is very simple:

  • The CHARINDEX ( 'city of', @OriginalString ) will return the starting position of the string 'city of'' and since we can count that ‘city of’ contains 7 characters and is followed by a space before the city name appears then we know that the starting position for the city name is CHARINDEX ( 'city of', @OriginalString ) + 8
  • The other expression is calculating the length of the city name. Now, we know that the city name is followed by ‘ has’ however, if we use CHARINDEX ( ' has', @OriginalString) that will return the starting position of the very first ‘ has’ in the original string (there may be more than one ‘ has’ in the original string)  and that is now what we want. We want the starting position of the ‘ has’ immediately after the city name, hence, we say find the starting position of ' has' but don’t look at the whole OriginalString, start looking after the 'city of'
Try this:

DECLARE @OriginalString varchar(1024)
SET @OriginalString = 'The crime rate in the city of New York has increased significantly'

SELECT SUBSTRING ( @OriginalString, CHARINDEX('city of', @OriginalString) + 8, CHARINDEX(' has', @OriginalString, CHARINDEX('city of', @OriginalString)) - (CHARINDEX('city of', @OriginalString) + 8))





0 comments:

Post a Comment