SQLComparison of NULL values by use of set operators in SQL Server

Morten Larsen

fre. 14. juni 201911 minutter lesetid

In this article we will go through a common task of searching a table for a filtering value using various approaches, ending up with the most elegant solution using set operators. This way of thinking will then be explored in other common usages, such as joins and MERGE statements.

Searching through a table using a SARG

Being able to search a table for a filtering value using an index can significantly improve performance in SQL Server.

A search without an index results in a so-called "Table Scan" and that implies that the whole table (maybe millions of rows) is read and compared to the filtering value, and this we want to avoid for performance reasons. A better performing search using an index can result in a so-called "Seek" and that implies that only the relevant part of the index (maybe just 5 or 10 rows) is read and compared to the filtering value, and this is generally much faster because of the smaller amount of IO traffic (disk reads).

To make a Seek happen requires of course that the table has an index on the column we are searching, but it also has some additional requirements to the expression in the WHERE clause: You can not apply any manipulation to the filtered column, like a function on the column, and your operator must be one that represents a consecutive range within the order that is defined by the index key. If these requirements are fulfilled, then the expression in the WHERE clause is considered a Search ARGument (also known as a SARG), and there is the potential for the SQL Server query optimizer to perform a seek instead of a table scan.

Our example: The Customer table

This calls for an example. Assume we have a table called Customer with a clustered index on the column CustomerId:

CREATE TABLE dbo.Customer (
    CustomerId INT IDENTITY(1,1) NOT NULL,
    Firstname  NVARCHAR(30) NULL,
    Lastname   NVARCHAR(30) NULL,
    Street     NVARCHAR(60) NULL,
    Zip        NVARCHAR(10) NULL,
    City       NVARCHAR(30) NULL,
    Phone      NVARCHAR(24) NULL,
    CONSTRAINT PK_Customer PRIMARY KEY CLUSTERED ( CustomerId )
);

And we have an additional index on the Phone column:

CREATE NONCLUSTERED INDEX Customer_Phone ON dbo.Customer ( Phone );

Let us say we want to write a stored procedure that takes the phone number as a parameter and returns that (or those) customers with the given phone number.

CREATE PROCEDURE FetchCustomer
    @Phone NVARCHAR(24)
AS
BEGIN
    SELECT CustomerId, Phone
    FROM dbo.Customer
    WHERE Phone = @Phone;
END;

We can then execute the stored procedure to find customers with a phone number of 55512345 like this:

EXEC dbo.FetchCustomer @Phone = '55512345';

If we in SQL Server Management Studio first press Ctrl-M and then F5, then we will see the actual query plan when we execute the stored procedure. The query plan shows a Seek that takes advantage of our index Customer_Phone. And of course, the query correctly returns the customer with the phone number specified.

For the rest of this article I want to focus on the query part, not the stored procedure part. So to make the example a bit simpler I remove the stored procedure and the execution of it, and I just have this simple script that does the same thing:

DECLARE @Phone NVARCHAR(24) = '55512345';
SELECT CustomerId, Phone
FROM dbo.Customer
WHERE Phone = @Phone;

Some customers do not have a phone number in the table and the value is then stored as NULL in the Phone column. To search for customers with no phone number one might think that this is the way to do it:

DECLARE @Phone NVARCHAR(24) = NULL;
SELECT CustomerId, Phone
FROM dbo.Customer
WHERE Phone = @Phone;

However, this will not work because in SQL one NULL is not considered equal to another NULL. The query returns no customers even though several customers are missing the phone number.

Using the IsNull function

Most people do the following to fix it: They use the IsNull function to replace the NULL with a value that can not normally appear in the data. For phone numbers the value 'XYZ' is not expected and can be used:

DECLARE @Phone NVARCHAR(24) = NULL;
SELECT CustomerId, Phone
FROM dbo.Customer
WHERE IsNull(Phone, 'XYZ') = IsNull(@Phone, 'XYZ');

Now we are getting the right result. The query will return the customers without phone number.

But the expression in the WHERE clause is not a SARG. The query plan shows an "Index Scan". So we are scanning all rows in the index. Note here that we are in fact using the index Customer_Phone, but not in the best way. What happens is that the SQL Server query optimizer sees that we are only interested in the columns that are covered by our index Customer_Phone so it does not have to use the table itself. Hereby it reduces the amount of data to read because the total size of the index is smaller than the table. But since we are applying a function (IsNull) to the Phone column then the expression is no longer a SARG and we can not benefit from a Seek to precisely pinpoint the customers without phone numbers, and we have to read all of the millions of rows in the index. That is what is happening when we see an "Index Scan" in the query plan. It is still better than a "Table Scan", but we would much prefer to see an "Index Seek" instead.

Using an OR clause

Here is the normally recommended way to make the expression in the WHERE clause into a SARG. We add an extra line with "OR" taking into account the special case when a NULL is supplied as the filter:

DECLARE @Phone NVARCHAR(24) = NULL;
SELECT CustomerId, Phone
FROM dbo.Customer
WHERE Phone = @Phone
OR (Phone IS NULL AND @Phone IS NULL);

Now the filter is a SARG, and we are getting the right result. The query plan shows an "Index Seek". With just 11000 rows in my test database the query runs 24 times faster than the previous query. With millions of rows in the table the performance gain would be even more dramatic. So we are happy.

Well, are we totally happy? This is fine with one filter/parameter, but what if you have more filters? Then the query will get more ugly and more messy, and consequently difficult to maintain.

Using set operators for comparison

There is a shorter and more elegant form that many people are not familiar with, but it does the job with less code. This form was shown to me by the great SQL Server expert Itzik Ben-Gan at the SQL Bits 2019 conference.

There is a group of SQL operators which instead of using equality based comparison use distinctness based comparison when making comparisons. This group is called set operators, and it includes the operators UNION, INTERSECT, EXCEPT. These SQL operators correspond to the well-known operators from the mathematical field of Set theory. In Set theory the operators go under the names of Union, Intersection, and Set Difference (the last one is also known as Relative Complement).

The solution is to use a distinctness based comparison by using a combination of a subquery and a set operator. Write WHERE EXISTS and then form two sets each made of one row with a column for each parameter. With just one parameter it looks like this:

DECLARE @Phone NVARCHAR(24) = NULL;
SELECT CustomerId, Phone
FROM dbo.Customer
WHERE EXISTS (SELECT Phone INTERSECT SELECT @Phone);

EXISTS returns True if the intersection contains any rows. The intersection contains rows if the queries to the left and right of the INTERSECT keyword have any matching rows in common. This takes care of NULL values just like we want, and the query plan shows an "Index Seek".

You can extend this method to any number of parameters you want, provided you have the right indexes created. Here is an example with a search on three parameters:

SELECT CustomerId, Phone
FROM dbo.Customer
WHERE EXISTS (SELECT Phone, Firstname, Lastname INTERSECT SELECT @Phone, @Firstname, @Lastname);

Joins with set operators

Distinctness based comparison can also be used in other contexts, such as for joins with columns that can be NULL. Let us assume that the customer's first names can also be NULL, and we want to find matching customers from two different customer tables. Then we can do it like this, using the set operator INTERSECT:

SELECT c.*
FROM       dbo.CoffeeShopCustomer AS c
INNER JOIN dbo.PizzaPlaceCustomer AS p
ON EXISTS (
    SELECT c.Phone, c.FirstName, c.LastName, c.City
    INTERSECT
    SELECT p.Phone, p.FirstName, p.LastName, p.City
);

Merge statements with set operators

Assume we have just loaded some customers into a staging table called CustomerStage and this is going to be our source for a merge into our target table Customer. We will match rows in the two tables by their column CustomerId. When there is a match (i.e. in both tables there is a row with the same CustomerId) then we perform an update of the target table so that all the other columns (besides CustomerId) will get a new value coming from the source table. When there is no match (i.e. there is a row in the source table with a CustomerId that could not be found in the target table) then we perform an insertion into the target table so that the row will be copied with all it's values from the source table.

This is normally written like this using a MERGE statement:

MERGE INTO dbo.Customer AS tgt
USING dbo.CustomerStage AS src
    ON src.CustomerId = tgt.CustomerId
WHEN MATCHED THEN
    UPDATE SET tgt.Firstname = src.Firstname
             , tgt.Lastname = src.Lastname
             , tgt.Street = src.Street
             , tgt.Zip = src.Zip
             , tgt.City = src.City
             , tgt.Phone = src.Phone
WHEN NOT MATCHED THEN
    INSERT ( CustomerId, Firstname, Lastname, Street, Zip, City )
    VALUES ( src.CustomerId, src.Firstname, src.Lastname, src.Street, src.Zip, src.City );

Often though, we want to make sure that if the source row is in fact identical to the target row (i.e. the name is the same, the address is the same, the phone is the same, etc.) then we do not want to apply the update, for these two reasons:

We do not want to spend time updating the table and all indexes on the table.
We do not want to execute any triggers or other auditing on the table.

The normal solution is then to add a predicate to the above line WHEN MATCHED THEN so that this line will be replaced by something like:

WHEN MATCHED AND (src.Firstname <> tgt.Firstname OR src.Lastname <> tgt.Lastname) THEN

(Of course, all the other fields should also be mentioned here).
(In SQL, the symbol <> means "not equal to", similar to != in other languages).

But this gets even more tricky when some of the columns support NULL values. Suppose for example that the Firstname and Lastname columns support NULL. Then the line will look like this:

WHEN MATCHED AND (src.Firstname <> tgt.Firstname
                  OR src.Firstname IS NULL AND tgt.Firstname IS NOT NULL
                  OR src.Firstname IS NOT NULL AND tgt.Firstname IS NULL
                  OR src.Lastname <> tgt.Lastname
                  OR src.Lastname IS NULL AND tgt.Lastname IS NOT NULL
                  OR src.Lastname IS NOT NULL AND tgt.Lastname IS NULL) THEN

Imagine how this will look if the table has 20 more columns in addition to Firstname and Lastname!

Fortunately there is a much cleaner way of doing this using distinctness based comparison, taking advantage of the set operator EXCEPT. The line can simply be written like this:

WHEN MATCHED AND EXISTS (SELECT src.* EXCEPT SELECT tgt.*) THEN

This says "take all the rows from the source table except those that precisely match any rows in the target table, and if any rows are left then perform an update of those rows", and it also handles NULL values without problems.

If the structures (i.e. column order and data types) of the source and target tables do not match perfectly then this method can still be used by defining a view on top of the source table and then using that view instead of the source table in the MERGE statement.

Conclusion

Besides the occasional usage of the UNION operator, the set operators are generally not much used in SQL. In this article I have argued that there are other ways of benefitting from the set operators. I have done so by demonstrating how to perform distinctness based comparison, in the contexts of searching, joining and merging. Have a go at trying this out for yourself. And maybe you can come up with even more ways to make good use of the powerful set operators.