Fundamentals: varchar is NOT a one size fits all data type

A common, and fatal, mistake I often see is the use of varchar/nvarchar being used for data that should be stored in a data type more appropriate; for example a date and time value. Storing data in a (n)varchar when it should be stored in a far more appropriate data type can have disastrous consequences for both the performance of your queries and the behaviour, as well as making tasks that seem like they should be simple far more difficult.

The most common data type I see stored in the wrong data type is date (and time) data; normally because the business/user has demanded that the data be in a specific format. Date and time data types do not have a format when stored, they are binary values, and it is up to the application to determine the format that the value is displayed in not the RDBMS. If you are being told that the data needs to be stored in a specific format, then you need to bounce that make to whomever is making that requirement and get them to talk to whomever is in charge of the application/presentation layer; they will then be able to change the presented format there.

Validating the value

A very simple, but fatal, problem with using the wrong data type is a just the simple lack of validation on the value. If you are storing numerical data in a (n)varchar unless you have a very good CHECK CONSTRAINT there is nothing stopping your users from putting incorrect values into your column. For example, you might find some users use a comma (,) as a thousands, millions, etc separator while others to denote a decimal point. Then when you see a value like '1,234' that might be either "one thousand, two hundred and thirty four", or it might be "one point two three four". Unless you ask the user that input you won't know.

For dates, you could easily end up with a date that doesn't exist, such as '31/04/2021' or even '29/02/2017'. April only has 30 days, and though February can have 29 days 2017 wasn't a leap year so there weren't 29 days that year.

Though I mentioned you could add a CHECK CONSTRAINT to the column, there's no replacement for using the correct data type here. Yes, you could have CHECK (TRY_CONVERT(date,VarcharDate,102) IS NOT NULL OR VarcharDate IS NULL), which eliminates the problem in this section, there are still consequences of using the wrong data type.

Comparisons against a varchar

If you are storing in the wrong data type, such as for a date, people who are querying your data will be unlikely to know your database is fundamentally flawed. If I, for example, see a column called PurchaseDate it's a "safe" assumption that the column will be a date and time data type such as a date or datetime2. As such when you perform a query you should be passing an appropriate date and time data type parameter or an unambiguous date(time) string literal (in SQL Server that would be yyyyMMdd or yyyy-MM-ddThh:mm:ss.nnnnnnn). Both of these, however, will have very likely have unexpected behaviour or performance.

For a string literal, the very likely scenario is going to be that you don't get any results back. If you're storing date and time data in your table as a varchar in the format dd/MM/yyyy then passing a string literal with a value in the format yyyyMMdd is never going to return a result. For someone that doesn't know your data (and flawed design) they could easily infer that there is no applicable data for that date, or assume the data is missing. Neither are the correct assumptions, but the user can't be blamed for assuming that a "date" column is a date can they.

If you're using parameters with the correct data type, then this will mean the column it is being compared to is going to be implicitly converted. Firstly, for performance, this is going to bad; any indexes you have won't be able to be used to perform a seek against the table. This known as the query is non-SARGable, it contains no Searchable ARGuments. For a small table, you may well no notice the problem, but for much larger tables, or when dealing with multiple tables, this is going to get exponentially worse as the entire table needs to be scanned for the values you're matching on.

But, rather than having a slow query, you might not even get a result. Let's assume that you have stored dates in a varchar column, and the values are in the format dd/MM/yyyy. Such a value is language specific. This means that for one LOGIN, which has its language set to BRITISH (as we don't speak English here in Britain according to Microsoft), the behaviour will be different to one where its language is set to ENGLISH (because Americans are English, not American). This is because the value in the table won't be read in the format dd/MM/yyyy but MM/dd/yyyy. This is very likely going to very likely end up in a conversion error, but worse could return the wrong result. Of either, I would honestly want the former as getting the incorrect data, without knowing it, is by far worse (in my opinion) than getting no data and an error.

This behaviour can be quite easily evidenced with a quick example:

CREATE TABLE dbo.SampleTable (VarcharDate varchar(10));
GO

INSERT INTO dbo.SampleTable
VALUES ('01/01/2020'), --01 Janurary 2020
       ('02/04/2020'); --02 April 2020
GO

SET DATEFORMAT dmy;
GO
DECLARE @date date = '20200101';
SET DATEFORMAT dmy;
--Works correct
SELECT *
FROM dbo.SampleTable
WHERE VarcharDate = @Date;

SET DATEFORMAT mdy;
--Works correct
SELECT *
FROM dbo.SampleTable
WHERE VarcharDate = @Date;
GO
DECLARE @date date = '20200402';
SET DATEFORMAT dmy;
--Works correct
SELECT *
FROM dbo.SampleTable
WHERE VarcharDate = @Date;

SET DATEFORMAT mdy;
--Returns no results
SELECT *
FROM dbo.SampleTable
WHERE VarcharDate = @Date;
GO

--Add a row to cause an error:
INSERT INTO dbo.SampleTable
VALUES('31/05/2020'); --31 May 2020
GO

DECLARE @date date = '20200531';
SET DATEFORMAT dmy;
--Works correct
SELECT *
FROM dbo.SampleTable
WHERE VarcharDate = @Date;

SET DATEFORMAT mdy;
--Errors!
SELECT *
FROM dbo.SampleTable
WHERE VarcharDate = @Date;
GO
--Clean up
DROP TABLE dbo.SampleTable;

Applying a CONVERT with a style code will alleviate the above behaviour, but you will have a non-SARGable query and so performance will (significantly) degrade.

Ordering is different

Numerical values, Dates, string (and other data types) all order differently. This means that if you are storing a value in a varchar when it should be stored in a more representative data type, you'll likely find that the data doesn't behave as you expect when you use an ORDER BY or the rows you would like to be returned when you query the data aren't the ones you want when you use operators like <, > and BETWEEN.

Let's start with numerical values. We all know that the number 10 is greater than the number 9, but what about the strings '10' and '9'? Well for the string the opposite is true; '10' is lower than the string '9'. String data types are ordered alphabetically, (normally) starting at the left most character and then moving right. For the 2 prior strings the first 2 characters would be compared, '1' and '9' respectively, and as '1' has a lower value than '9', then so too must '10'. This also means that a value like '919199' would have a lower value than a number like '97' as the first characters match, but then the second character in the first value has a lower value, and so the entire value must have a "lower" value.

For values like dates, the same principle applies. Take the 2 "dates" (varchars) '09/07/2020' and '10/03/2017'. No matter what language you're in, the first value would be the "greater" value; it's in 2020 where as the latter is in 2017. Of course, however, as we're dealing with a varchar once again the opposite is true, and the latter date has the "greater" value. In effect this means that, according to the date, the "date" '10/03/2017' is after '09/07/2020', which is obviously not true.

When using the <, > and BETWEEN operators, then you can get some really odd results. Take the clause WHERE VarcharDate >= '20/03/2020' AND VarcharDate < '09/04/2020'. This clause will never return any data as it is impossible for a (varchar) value to be lower than the value '09/04/2020' and greater or equal to the value '20/03/2020'. That's because the second value in the clause has a lower value than first, so nothing fits between the 2 values. It would be akin to asking for a number larger than 10 and lower than 5; no such numbers exist as 10 is larger than 5.

This behaviour, unsurprisingly, effects your ORDER BY clause as well, and your data will be sorted as the data type it is, a varchar. Take the below statement, which gives some (now) unsurprising results:

SELECT V.I
FROM (VALUES('1'),
		    ('4'),
		    ('9'),
		    ('17'),
		    ('24'),
		    ('78'),
		    ('207'),
		    ('409'),
		    ('8999'))V(I)
ORDER BY V.I;

Which gives you the results below:

I
----
1
17
207
24
4
409
78
8999
9

Conclusion

The conclusion here is simple, always use the correct data type for the data you are storing. If it's date/time then use one of the 6 date and time data types, if it's numerical then you likely want decimal or int (yes, float exists, but that comes with caveats and often misused). If your users "demand" that the data is in a specific format, get them to speak to the application/presentation layer's developers or pass the information on yourself. What format the users see the data in should have little impact on the "format" it is stored in, as data types like date are stored as binary values with no formatting information; how that value is then displayed is completely up to the application. If you have decided to store the data as a varchar then know that you are in for a world of "hurt" at some point down the line; either in performance or behaviour, and that you should heavily consider fixing that design sooner, rather than later before it becomes a much larger problem.

Leave a Reply

Your email address will not be published. Required fields are marked *