SQLServer Ranking
SQL Server 2005 introduced several new functions that belong to a group termed "Ranking Functions". In Books Online (BOL), Ranking Functions are described as “Ranking functions return a ranking value for each row in a partition. Depending on the function that is used, some rows might receive the same value as other rows. Ranking functions are nondeterministic.” The four ranking functions are ROW_NUMBER(), NTILE(), RANK() and DENSE_RANK(). Note that you might find references to these as “Windowing Functions” in some places on the internet.
BOL describes the ROW_NUMBER() function as “Returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition.” Its syntax is described as:
ROW_NUMBER ( ) OVER ( [ <partition_by_clause> ] <order_by_clause> )
Now, it’s time to show examples of how this works. First, let’s set up a business case. Assume that you run a nationwide scholarship program. Candidates can apply each year for up to 10 schools. You want to produce a report for the top 5 schools that each applicant has applied to, where they meet the eligibility requirements to attend, for the current school year. Ignoring all of the related tables that would be involved, the table at the root of this report would be defined as:
declare @Candidates table (CandidateID int NOT NULL, [Year] smallint NOT NULL, ChoiceNbr tinyint NOT NULL, SchoolID smallint NOT NULL, MeetsEligibility bit NOT NULL);
Now, let’s add some data to this table:
insert into @Candidates (CandidateID, [Year], ChoiceNbr, SchoolID, MeetsEligibility)select 1,2010,1,1,1 UNION ALL select 1,2010,2,2,1 UNION ALLselect 1,2010,3,3,1 UNION ALL select 1,2010,4,4,1 UNION ALLselect 1,2010,5,5,0 UNION ALL select 1,2010,6,6,0 UNION ALLselect 1,2010,7,7,1 UNION ALL select 1,2010,8,8,1 UNION ALLselect 1,2010,9,9,0 UNION ALL select 1,2010,10,10,1 UNION ALLselect 2,2010,1,1,0 UNION ALL select 2,2010,2,2,1 UNION ALLselect 2,2010,3,3,1 UNION ALL select 2,2010,4,4,0 UNION ALLselect 2,2010,5,5,1 UNION ALL select 2,2010,6,6,1 UNION ALLselect 2,2010,7,7,1 UNION ALL select 2,2010,8,8,0 UNION ALLselect 2,2010,9,9,1 UNION ALL select 2,2010,10,10,1;
If you run the SQL statement select * from @Candidates, you will see that there are two candidates applying to 10 schools each for the year 2010. They each meet the eligibility requirements to 7 schools, though not for the same choice number. To get just the schools where they meet the eligibility requirements, we obviously need to retrieve only those records where MeetsEligibility=1. The SQL Statement:
select * from @Candidates where MeetsEligibility = convert(bit,1);
Returns:
CandidateID | Year | ChoiceNbr | SchoolID | MeetsEligibility |
1 | 2010 | 1 | 1 | 1 |
1 | 2010 | 2 | 2 | 1 |
1 | 2010 | 3 | 3 | 1 |
1 | 2010 | 4 | 4 | 1 |
1 | 2010 | 7 | 7 | 1 |
1 | 2010 | 8 | 8 | 1 |
1 | 2010 | 10 | 10 | 1 |
2 | 2010 | 2 | 2 | 1 |
2 | 2010 | 3 | 3 | 1 |
2 | 2010 | 5 | 5 | 1 |
2 | 2010 | 6 | 6 | 1 |
2 | 2010 | 7 | 7 | 1 |
2 | 2010 | 9 | 9 | 1 |
2 | 2010 | 10 | 10 | 1 |
This removes the schools where they don’t meet the eligibility requirements, but we are still retrieving more than the desired 5 records for each candidate. If we use a top 5, then we only get the first 5 rows, not the first 5 for each candidate. Now, we could go through a looping mechanism, but that would be inefficient. Taking a look at the ROW_NUMBER() function, we see that it applies a sequential number to each row. The ROW_NUMBER() function assigns the row number in the order specified by the ORDER BY clause. Since we want the report for the top 5 schools for each candidate, we’ll use the ChoiceNbr column. So, let’s try it out with this SQL statement:
select *, RowNbr = ROW_NUMBER() OVER (ORDER BY ChoiceNbr) from @Candidates where MeetsEligibility = convert(bit,1);
(Note that the ORDER BY clause on the ROW_NUMBER() function is NOT the same as the ORDER BY clause on the SELECT statement. On the ROW_NUMBER() function, it controls the order of the assignment of the row number. Without an ORDER BY clause on the SELECT statement, the overall results of the query are not guaranteed to be in any particular order.)
This query returns:
CandidateID | Year | ChoiceNbr | SchoolID | MeetsEligibility | RowNbr |
1 | 2010 | 1 | 1 | 1 | 1 |
1 | 2010 | 2 | 2 | 1 | 2 |
2 | 2010 | 2 | 2 | 1 | 3 |
2 | 2010 | 3 | 3 | 1 | 4 |
1 | 2010 | 3 | 3 | 1 | 5 |
1 | 2010 | 4 | 4 | 1 | 6 |
2 | 2010 | 5 | 5 | 1 | 7 |
2 | 2010 | 6 | 6 | 1 | 8 |
2 | 2010 | 7 | 7 | 1 | 9 |
1 | 2010 | 7 | 7 | 1 | 10 |
1 | 2010 | 8 | 8 | 1 | 11 |
2 | 2010 | 9 | 9 | 1 | 12 |
2 | 2010 | 10 | 10 | 1 | 13 |
1 | 2010 | 10 | 10 | 1 | 14 |
Here we see that the rows have been numbered in the RowNbr column, and that the results are ordered by the ChoiceNbr. However, the candidates are intermixed, and there still isn’t an easy way to get the top 5 schools for each candidate. Taking a look again at the ROW_NUMBER() function, we see that there is an optional PARTITION BY clause which will divide the result set into groups where the ROW_NUMBER() assignment is restarted at the start of each partition. Since we want to see the top 5 schools per CandidateID, let’s partition by the CandidateID with this SQL statement:
select *, RowNbr = ROW_NUMBER() OVER (PARTITION BY CandidateID ORDER BY ChoiceNbr) from @Candidates where MeetsEligibility = convert(bit,1);
This query returns:
CandidateID | Year | ChoiceNbr | SchoolID | MeetsEligibility | RowNbr |
1 | 2010 | 1 | 1 | 1 | 1 |
1 | 2010 | 2 | 2 | 1 | 2 |
1 | 2010 | 3 | 3 | 1 | 3 |
1 | 2010 | 4 | 4 | 1 | 4 |
1 | 2010 | 7 | 7 | 1 | 5 |
1 | 2010 | 8 | 8 | 1 | 6 |
1 | 2010 | 10 | 10 | 1 | 7 |
2 | 2010 | 2 | 2 | 1 | 1 |
2 | 2010 | 3 | 3 | 1 | 2 |
2 | 2010 | 5 | 5 | 1 | 3 |
2 | 2010 | 6 | 6 | 1 | 4 |
2 | 2010 | 7 | 7 | 1 | 5 |
2 | 2010 | 9 | 9 | 1 | 6 |
2 | 2010 | 10 | 10 | 1 | 7 |
We’re getting pretty close to what we want to see now: every time the CandidateID changes, the row numbering restarts. To get just the top 5 per candidate, just return the rows where RowNbr <= 5 with:
select CandidateID, [Year], ChoiceNbr, SchoolID, RowNbr from (select *, RowNbr = ROW_NUMBER() OVER (PARTITION BY CandidateID ORDER BY ChoiceNbr) from @Candidates where MeetsEligibility = convert(bit,1)) a where a.RowNbr <= 5 order by CandidateID, RowNbr;
Which returns:
CandidateID | Year | ChoiceNbr | SchoolID | RowNbr |
1 | 2010 | 1 | 1 | 1 |
1 | 2010 | 2 | 2 | 2 |
1 | 2010 | 3 | 3 | 3 |
1 | 2010 | 4 | 4 | 4 |
1 | 2010 | 7 | 7 | 5 |
2 | 2010 | 2 | 2 | 1 |
2 | 2010 | 3 | 3 | 2 |
2 | 2010 | 5 | 5 | 3 |
2 | 2010 | 6 | 6 | 4 |
2 | 2010 | 7 | 7 | 5 |
This is now returning what we want.
“But wait, there’s more!” What happens if we add data from another year to the table? To do this, I’ll just use the previous year, and use the same SchoolID + 100 as those already existing in the table:
insert into @Candidatesselect CandidateID, [Year]-1, ChoiceNbr, SchoolID + 100, MeetsEligibility from @Candidates;
When we run the above select SQL statement, we get:
CandidateID | Year | ChoiceNbr | SchoolID | RowNbr |
1 | 2010 | 1 | 1 | 1 |
1 | 2009 | 1 | 101 | 2 |
1 | 2009 | 2 | 102 | 3 |
1 | 2010 | 2 | 2 | 4 |
1 | 2010 | 3 | 3 | 5 |
2 | 2009 | 2 | 102 | 1 |
2 | 2010 | 2 | 2 | 2 |
2 | 2010 | 3 | 3 | 3 |
2 | 2009 | 3 | 103 | 4 |
2 | 2009 | 5 | 105 | 5 |
Here we see that we are getting data intermixed between the years. Luckily, both the ORDER BY and PARTITION BY clauses can accept multiple columns. Let’s add the Year to the PARTITION BY clause, and add to the WHERE clause the code necessary to get data for just the current year:
select CandidateID, [Year], ChoiceNbr, SchoolID, RowNbr from (select *, RowNbr = ROW_NUMBER() OVER (PARTITION BY CandidateID, [Year] ORDER BY ChoiceNbr) from @Candidates where MeetsEligibility = convert(bit,1) and [Year] = 2010) a where a.RowNbr <= 5 order by CandidateID, RowNbr;
This query returns:
CandidateID | Year | ChoiceNbr | SchoolID | RowNbr |
1 | 2010 | 1 | 1 | 1 |
1 | 2010 | 2 | 2 | 2 |
1 | 2010 | 3 | 3 | 3 |
1 | 2010 | 4 | 4 | 4 |
1 | 2010 | 7 | 7 | 5 |
2 | 2010 | 2 | 2 | 1 |
2 | 2010 | 3 | 3 | 2 |
2 | 2010 | 5 | 5 | 3 |
2 | 2010 | 6 | 6 | 4 |
2 | 2010 | 7 | 7 | 5 |
Now we are returning the desired result.
One of the other operations you can perform is an update statement. If you wanted to update the Year for the 4th of the candidate’s choices to 2020 for Candidate 1, you could run this SQL statement:
UPDATE dtMyUpdate SET [Year] = 2020 FROM (select *, RowNbr = ROW_NUMBER() OVER (PARTITION BY CandidateID, [Year] ORDER BY ChoiceNbr) from @Candidates where MeetsEligibility = convert(bit,1) and [Year] = 2010) dtMyUpdate WHERE CandidateID = 1 AND RowNbr = 4
Here SQL builds a derived table that gets the RowNbr using the ROW_NUMBER() function, and then updates it for the specified criteria.
Another nice thing that the ROW_NUMBER() function can be used for is in identifying contiguous data ranges. Instead of covering that in this article, I’m going to refer you to this blog by Dave Ballantyne in which this is covered: http://sqlblogcasts.com/blogs/sqlandthelike/archive/2009/08/27/sql-and-contiguous-data-ranges.aspx
Now, let’s take a look at the remaining ranking functions. BOL describes the RANK() function as “Returns the rank of each row within the partition of a result set. The rank of a row is one plus the number of ranks that come before the row in question.” DENSE_RANK() is described as “Returns the rank of rows within the partition of a result set, without any gaps in the ranking. The rank of a row is one plus the number of distinct ranks that come before the row in question.” NTILE() is described as “Distributes the rows in an ordered partition into a specified number of groups. The groups are numbered, starting at one. For each row, NTILE returns the number of the group to which the row belongs.” For RANK() and DENSE_RANK(), their syntax is identical to that of ROW_NUMBER(). The syntax for NTILE() is:
NTILE (integer_expression) OVER([<partition_by_clause>] < order_by_clause > )
The NTILE() function will group the results into the number of groups as specified by the integer_expression. If the groups cannot be evenly distributed, the lower numbered groups will contain one extra record.
Now, let’s look at the difference between RANK() and DENSE_RANK(). For both of these, if there is a tie between records, then all records with the tie will receive the same rank number. The difference is in what happens with the NEXT record. For DENSE_RANK(), the next rank number is used. For RANK(), the value assigned is as if all preceding records had no ties. If there is no partitioning involved, this would be the same as the ROW_NUMBER().
To show how all of these work, let’s look at a small-town graduating class of 12 students, and how all of the ranking functions will work with looking at their grade. Some of the students graduated with the same grade, so this will show the difference between RANK() an dDENSE_RANK(). We’ll also use NTILE() to break this class down into 5 groups.
declare @ClassRank table (StudentID tinyint, Grade tinyint)insert into @ClassRank select 1, 100 UNION ALL select 2, 100 UNION ALL select 3, 99 UNION ALL select 4, 98 UNION ALL select 5, 95 UNION ALL select 6, 95 UNION ALL select 7, 90 UNION ALL select 8, 89 UNION ALL select 9, 89 UNION ALL select 10, 85 UNION ALL select 11, 85 UNION ALL select 12, 82 select StudentID, Grade, RowNbr =ROW_NUMBER() OVER(ORDER BY Grade DESC), [Rank] =RANK() OVER(ORDER BY Grade DESC), DenseRank =DENSE_RANK() OVER(ORDER BY Grade DESC), [NTile] =NTILE(5) OVER (ORDER BY Grade DESC) from @ClassRank
This query will return:
StudentID | Grade | RowNbr | Rank | DenseRank | NTile |
1 | 100 | 1 | 1 | 1 | 1 |
2 | 100 | 2 | 1 | 1 | 1 |
3 | 99 | 3 | 3 | 2 | 1 |
4 | 98 | 4 | 4 | 3 | 2 |
5 | 95 | 5 | 5 | 4 | 2 |
6 | 95 | 6 | 5 | 4 | 2 |
7 | 90 | 7 | 7 | 5 | 3 |
8 | 89 | 8 | 8 | 6 | 3 |
9 | 89 | 9 | 8 | 6 | 4 |
10 | 85 | 10 | 10 | 7 | 4 |
11 | 85 | 11 | 10 | 7 | 5 |
12 | 82 | 12 | 12 | 8 | 5 |
Let’s examine the results for the first three students. The first two have the same grade, so they are both ranked #1 by both the RANK() and DENSE_RANK() functions. The third student has a different grade, and is assigned to a different rank. With the RANK() function, since this is the third student the assigned rank is #3. With the DENSE_RANK() function, the assigned rank is the next rank – in this case, #2.
In the NTILE() function, we specified to return 5 groups. Since there are 12 students, the results break down to all groups having 2 students, with the beginning groups being assigned an additional student until the total number of students are accounted for.
Another analogy to help you understand the RANK() function would be in sports. Multiple teams can be tied for the same position. If two teams are tied for the #2 spot, the 4th team is not ranked #3 – they would be ranked #4.
Hopefully, you will be able to use the ranking functions to eliminate some of the cursors (or other inefficient looping mechanisms) in your code, and to replace it with a more efficient set-based method.
References:
BOL Ranking Functions | http://msdn.microsoft.com/en-us/library/ms189798.aspx |
BOL ROW_NUMBER() | http://msdn.microsoft.com/en-us/library/ms186734.aspx |
BOL OVER Clause | http://msdn.microsoft.com/en-us/library/ms189461.aspx |
BOL ORDER BY Clause | http://msdn.microsoft.com/en-us/library/ms188385.aspx |
BOL RANK() | http://msdn.microsoft.com/en-us/library/ms176102.aspx |
BOL DENSE_RANK() | http://msdn.microsoft.com/en-us/library/ms173825.aspx |
BOL NTILE() | http://msdn.microsoft.com/en-us/library/ms175126.aspx |