If you’re a database practitioner, you’ve probably heard people talking about the importance of thinking in terms of sets when addressing querying tasks as opposed to thinking in iterative or cursor terms. You’ve also probably heard people using the term set-based to describe solutions, and perhaps you’ve even used the term yourself. But what does thinking in sets really mean? And why is it so important? I’ll try to answer these questions.
Why Sets?
Before I discuss what thinking in sets means, I want to talk about the philosophical aspects of the importance of thinking in terms of sets, or more specifically in relational terms. IT is a field that advances rapidly. New technologies keep being introduced, while existing technologies keep changing or disappear because they’ve become obsolete. However, there’s at least one computing technology that remains stable: the relational model. This profound mathematical model for database management was created by Edgar F. Codd in 1969. It hasn’t remained completely unchanged since its creation. Codd revised it, and it evolved thanks to efforts by Chris Date, Hugh Darwen, and others. But compared with other computing technologies, it’s much more stable. Similarly, the languages based on the relational model, such as SQL and its dialects (e.g., T-SQL), keep evolving but not as rapidly as other computing technologies.
One reason for the longevity of the relational model and the related languages is that the model is based on two strong mathematical foundations: set theory and predicate logic. If you learn the basic principles behind set theory and predicate logic, you’ll better understand the code you’re writing and hence write much better code.
The relational model isn’t perfect; perhaps someday a different model will replace it. But as long as it’s the dominant model, you want to flow with it and not against it. This means that you should use set-based solutions and not iterative or cursor solutions.
SQL (and its dialects) isn’t perfect either—in fact, SQL deviates from the relational model in many ways. It, too, might be replaced by a different language in the future. But at the moment, SQL is the de facto industry language for relational database management systems (RDBMSs). So, as long as you understand relational principles, you can use SQL in a relational way.
What Is a Set?
To understand what a set is, it’s helpful to look at how Georg Cantor, the creator of the mathematical set theory, defined a set. According to Joseph W. Dauben’s book Georg Cantor (Princeton University Press, 1990), Cantor defined a set as follows: “By a ‘set’ we mean any collection M into a whole of definite, distinct objects m (which are called the ‘elements’ of M) of our perception or of our thought.”
Observe the italicized words whole and distinct. Also observe that nowhere in the definition of a set does Cantor refer to the order of the elements. Let’s analyze these three aspects of a set in more detail.
A Set As a Whole
When you interact with a set, you need to think of it as a whole—not as individual elements. For example, when writing SQL queries, you interact with the input table as a whole and not with its individual rows. That’s set-based. When using a cursor, you deal with one row at a time. That’s not set-based.
Some people think that if you don’t explicitly use a cursor, the solution must be set-based. Consider, for example, a solution that implements the following logic, expressed in pseudo code:
retrieve top (1) row order by keywhile current key is not null
begin
process row with current key
retrieve top (1) row
where key > current key order by key
end
Although this iterative process doesn’t use a T-SQL CURSOR object, it deals with one row at a time. It’s far more expensive than an actual cursor in terms of its I/O footprint. At least with the cursor, you perform one pass over the data, so its I/O imprint is much less.
I already discussed the philosophical reasons for using set-based solutions. There’s also a more practical reason—performance. SQL Server’s implementation of T-SQL iterative constructs such as the WHILE loop is very inefficient. If you run a WHILE loop for a million iterations in T-SQL and a WHILE loop for a million iterations in a procedural language such as C#, you’ll see a striking difference in performance. On my laptop, the T-SQL WHILE loop took more than 100 seconds to complete, whereas the C# WHILE loop took under 10 seconds. That’s an order of magnitude difference. So, iterations in T-SQL are slow.