Ads Top

SQL for Tableau: Window Functions

It’s been a while since I’ve shared a blog in my SQL for Tableau Users series, so I’m excited to be back with a blog on one of my absolute favorite features of SQL, window functions!! OK, they may not sound that exciting but they are really powerful and, once you get the syntax down, relatively easy to write.


Let’s start out by defining window functions. I personally like the definition provided by the PostgreSQL documentation:

A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row—the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.

OK, that’s a bit confusing, but it’ll become much clearer as we look at some examples. So let’s start by looking at a simple example using Superstore on my publicly-available SQL Server (see SQL for Tableau Part 1: The Basics for details on connecting to this server).

We’ll begin with a simple aggregation of sales by region.

SELECT [Region], SUM([Sales])
FROM [Orders]
GROUP BY [Region]


We get one record for each region with each region’s total sales. But what if we want to see every row along with the sum of sales for the region?

SELECT [Row ID], [Region], [Sales],
SUM([Sales]) OVER (PARTITION BY [Region]) AS [Region Sales]
FROM [Orders]


The second line of the SQL leverages a window function to provide us with data outside of the row we’re looking at.

Let’s take a look at the the syntax:

SUM([Sales]) OVER (PARTITION BY [Region])

This statement starts with a function—in this case, an aggregation—then is followed by OVER. OVER essentially tells the database that we’re about to define the window. Then, within parentheses, we define the window using the PARTITION BY clause. This tells the database that we want to partition (i.e. group) our records together by the Region. So, in plain English, this would say “Sum the sales for the entire region that is specified on this row.”

Note: If you do not wish to partition your data at all (i.e. get the total sales for the entire data set), then you would simply not include the PARTITION BY clause:

SUM([Sales]) OVER () AS [Region Sales]

What’s nice about this is that it allows us to then perform some additional aggregations. For example, we could find each row’s percentage of the region’s sales.

SELECT [Row ID], [Region], [Sales]/SUM([Sales]) OVER (PARTITION BY [Region]) AS [% of Sales]
FROM [Orders]


To take this one step further, we can aggregate the main query as well. For example, let’s say we want to see the percentage of sales by customer, rather than each individual row. To do that, we’ll need to aggregate the sales by customer then divide by the total regional sales:

SELECT [Region], [Customer Name],
SUM([Sales]) AS [Customer Sales],
SUM(SUM([Sales])) OVER (PARTITION BY [Region]) AS [Region Sales]
FROM [Orders]
GROUP BY [Region], [Customer Name]


Notice that we need to further aggregate the window calculation. Without it, we’ll get an error. I won’t show you here, but you can see that we could now easily calculate the percent of regional sales for each customer.

Similarity to Tableau
If the above feels familiar to you, that’s because you’ve probably done something very similar in Tableau. Let’s take the above use case (each record’s percent of regional sales) and use Tableau to solve it. To do this, we’d use either LODs or window calculations. Let’s begin with window calcs. We’d start by creating our basic view:


Then we could write a calculation like this:

Region Sales
// Sales for the overall region.
WINDOW_SUM(SUM([Sales]))

Then we add this to the view, make sure it’s set to compute properly, and we’ll get:


Then, of course, we could perform the math, SUM([Sales])/[Region Sales] to get the percentage.

If we examine this closely, we can see how similar the SQL window functions are to this window SUM. Let’s look at them together:

SUM(SUM([Sales])) OVER (PARTITION BY [Region])

WINDOW_SUM(SUM([Sales]))

We don’t have to tell SQL that it’s a window calculation because of the OVER clause, whereas Tableau needs to know that we’re using a window sum. Additionally, in SQL, we specify how to partition our data within the script, while in Tableau, we specify this partitioning when we tell it how to compute the table calculation.

........................

A second option for doing this in Tableau would be an LOD calculation. Instead of the window sum, we’d do the following:

Region Sales
// Sales for the overall region.
{FIXED [Region]: SUM([Sales])}

This will produce the same result as our table calc. While this is syntactically less similar to the SQL than the window sum, we can definitely see some similarities. Fixing on Region in the LOD is similar to partitioning on Region in the SQL. Then, of course, we’re summing the sales within that partition.

Interestingly, the definition of window calculations shared earlier included the following: “Behind the scenes, the window function is able to access more than just the current row of the query result.” That sounds a lot like an LOD calculation and, as we’ve seen, they have a lot of similarities.

It’s important to point out that neither Tableau window calcs nor LODs are exactly synonymous with SQL window calcs. These Tableau calculations can change based on the dimensions on your view, the way you set them up to be computed, filters, and a number of other things. However, there are a lot of similarities, so considering your likely familiarity with these concepts already, I think they are valuable to help illustrate how SQL window calcs work.

Order By
We’ve seen how we can use PARTITION BY to partition our window functions, but there’s also another part of the syntax that can be very valuable, ORDER BY. To demonstrate this, let’s take a look at a scenario where we’d like to number each record. To do this, we can use the window function, ROW_NUMBER:

SELECT [Row ID], [Region], [Customer Name],
ROW_NUMBER() OVER (ORDER BY [Row ID]) AS [Row]
FROM [Orders]


In this case, we’re telling the database to give us a sequential row number ordered by Row ID. Since Row ID is a sequential number already, it’s not surprising that we get the same value. However, we can do a variety of other things, including changing the sort order. Just like normal ORDER BY statements, the data is sorted ascending by default, but if we add DESC after the sort, it’ll sort descending.

SELECT [Row ID], [Region], [Customer Name],
ROW_NUMBER() OVER (ORDER BY [Row ID] DESC) AS [Row]
FROM [Orders]


We can also sort on multiple fields. For example, let’s say we wish to number our rows first by Region, then by Sales from highest to lowest.

SELECT [Row ID], [Region], [Customer Name], [Sales],
ROW_NUMBER() OVER (ORDER BY [Region] ASC, [Sales] DESC) AS [Row]
FROM [Orders]


We’ve now essentially ranked our data for each region. Or did we? If we scroll down to the point where it changes from the Central region to the East, we can see what might be a problem (depending on our goal).


In this case, it just continues to number the records once it gets to a new region. But what if we want it to start over at each region (just like we might do with the “Restarting every” option of a Tableau table calculation)? This is where we can reintroduce PARTITION BY.

SELECT [Row ID], [Region], [Customer Name],
ROW_NUMBER() OVER (PARTITION BY [Region] ORDER BY [Region] ASC, [Sales] DESC) AS [Row]
FROM [Orders]


Because we’ve told the database to partition by the Region, it will restart the numbering when a new Region is encountered. While rudimentary, this works as a basic ranking mechanism, showing us the rank of each record within its given region.

While I’m here, I think it’s important to point out the similarity in this syntax to Tableau Prep’s new analytic functions. One of the fantastic functions they’ve made available is none other than ROW_NUMBER. In Prep, you’d perform this same basic row numbering technique using the following syntax:

{PARTITION [Region]: {ORDERBY [Region] ASC, [Sales] DESC : ROW_NUMBER()}}

This syntax is very similar to what we use in SQL, down to the key words used (PARTITION, ORDERBY)—even more so than the window calcs and LODs we discussed earlier.

Additional Functions
There are a handful of additional window functions that we have not yet discussed, two are which are my absolute favorite (I’ll save them for last). So, here are a few of them, with examples:

RANK/DENSE_RANK
Not surprisingly, this will rank your data. So, instead of the row numbering method used above, we could write this:

SELECT [Row ID], [Region], [Customer Name],
RANK() OVER (PARTITION BY [Region] ORDER BY [Region] ASC, [Sales] DESC) AS [Row]
FROM [Orders]


Unlike the row number function, rank handles ties by giving both rows the same rank (just like the RANK function in Tableau). There is also a variant of this, DENSE_RANK, which works just like Tableau’s RANK_DENSE function.

FIRST_VALUE/LAST_VALUE
These will return the first or last value within a partition. For example, let’s say that you want to compare a customer’s sale to that customer’s first ever sale. Without window functions, this is pretty difficult and requires various sub-queries, but FIRST_VALUE makes it much easier.

SELECT [Row ID], [Customer Name], [Order Date], [Sales],
FIRST_VALUE([Sales]) OVER (PARTITION BY [Customer Name] ORDER BY [Order Date]) AS [First Sale Amount]
FROM [Orders]


Likewise, we can use LAST_VALUE to obtain the last value in the partition.

LEAD/LAG
OK, now for my favorite window functions, LEAD and LAG. LEAD allows you to get a value from a later row in the data set, while LAG allows you to get a value from a previous row. For example, the following SQL will get a customer’s prior and next sales amount:

SELECT [Customer Name], [Row ID], [Order Date], [Sales] ,
LAG([Sales])  OVER (PARTITION BY [Customer Name] ORDER BY [Customer Name], [Order Date]) AS [Prior Sales],
LEAD([Sales]) OVER (PARTITION BY [Customer Name] ORDER BY [Customer Name], [Order Date]) AS [Next Sales]
FROM [Orders]


Though not shown in the above example, the LEAD and LAG functions have two additional parameters—offset and default value. Offset allows you to specify the number of rows back or forward you move, while the default value allows you to specify a value to return when we are outside of a partition. For instance, the NULL values in the above example—using Customer Name as a partition—result from the fact that the customer does not have a previous or next record.

We could, for example, write the following which will move backward and forward by 2 rows and return 0 as a default.

SELECT [Customer Name], [Row ID], [Order Date], [Sales] ,
LAG([Sales], 2, 0)  OVER (PARTITION BY [Customer Name] ORDER BY [Customer Name], [Order Date]) AS [Sales 2 Back],
LEAD([Sales], 2, 0) OVER (PARTITION BY [Customer Name] ORDER BY [Customer Name], [Order Date]) AS [Sales 2 Fwd]
FROM [Orders]


The reason I love these functions so much is that they have a ton of different use cases. Often, we find ourselves leveraging the LOOKUP function in Tableau to perform this operation. While that can often give us the correct results, having these values on a single record provides us so much more flexibility since it allows us to perform row-level calculations, avoiding the more complex table calculations.

For a great example of the power of LEAD and LAG, check out the blog I co-authored with fellow Tableau Zen Master, Klaus Schulte: Data Source-Based Lookups in Tableau. The blog includes multiple methods to solve a common problem, one of which uses the LAG window function.

Conclusion
So that’ll do it for this introduction to SQL window functions. We’ve only just scratched the surface as these functions can be nested, joined to other tables, and, in general, used in all kinds of sophisticated ways. But I’m hopeful that this introduction is enough to get you started writing your own window functions. As you can imagine, these are great for performing some up-front data prep, which could help out a lot when you get to Tableau—both by simplifying your Tableau calculations and potentially helping performance.

Before I go, I do need to note that, while window functions are now part of ANSI standard SQL, there are some platforms that do not support them (Microsoft Access, for example), but this list is short and getting shorter. I should also point out that there could be some slight syntactical differences between platforms, so if you use my SQL above and get errors, be sure to check the documentation on your specific database platform.

Thanks for reading!!

Ken Flerlage, October 12, 2020


5 comments:

  1. Ken - thank you so much for taking the time to put this series together.

    I've found it incredibly helpful and especially appreciate the connections you make between these methods in SQL and their similar implementations in Tableau. That has really helped clarify some things that I only understood vaguely before now.

    ReplyDelete
  2. This is so good, well explained and documented nicely. Very useful information and good sql examples.

    ReplyDelete
  3. Really good post, thank you for sharing!

    ReplyDelete

Powered by Blogger.