Equal-Width Sankey: A New Approach to Drawing Sankey Curves (with a Tableau Template)


Note: This blog is going to get into the weeds of sankeys and the math behind them. I highly recommend reading this as it will help you to understand and avoid some of the pitfalls of sankeys. It will also help you to understand how these charts work under the covers. But, if you just want to use the template, that's okay too. Feel free to skip to The Template section at the bottom where I show you how to use the template with your own data. Thanks!


Last year, Jeff Shaffer, the first person to build a sankey in Tableau, posted a blog titled, Sankey Diagrams: Why I Used the Sigmoid Functionand Why You Probably Shouldn’t. In the blog, he discussed the fact that the sigmoid curves used to draw most sankeys in Tableau have a fundamental problem as the curve tends to narrow in the middle preventing it from maintaining an equal width from beginning to end. He suggested that a better approach would be to use sine curves instead of sigmoid curves as the narrowing impact is much less severe. Jeff’s blog provided the following visual to show the issue and compare different curve types (based on the work of Chris DeMartini).


But, as Jeff pointed out, this is still not perfect. There is definitely still some narrowing in the middle of the sine curve.

The Problem
There are two things that cause this, the first being the approach used to draw the curve. To demonstrate, let’s start out with two equal-sized bars on either side of our sankey.

  
Now we’ll draw one curve connecting the tops of each bar. Note: I’m using a sigmoid for this example.


Next, we’ll connect the bottoms of each bar using the same kind of curve.


We can now clearly see the narrowing in the middle. The approach of drawing two separate curves to connect the top and bottom simply does not guarantee equal spacing between the two curves.

The second problem we encounter has to do with the sizing of the sheet. Take, for example, the following


We can see the first problem in this example. But what if we resize the sheet after placing it on a dashboard? In the image below, I’ve kept the same height but have made the sheet more narrow. This exacerbates the problem—the start and end of the curve are the same width as the previous image, but the narrowing effect on the middle of the curve is more severe.


So, the size of the sheet on a dashboard is also critical when considering a method to draw equal width curves.

Note: While both of these problems are more pronounced with sigmoid curves, as Jeff has pointed out, other curve types have the same issue.

A Potential Solution
A solution to this problem will have to address both of the items above—it must ensure that the distance between the curves remains consistent and it must address the sizing problem. Fortunately, at the end of Jeff’s blog, he included a link to a post by Sam Calisch on Github, from 2011, which describes a mathematical approach to drawing sankey curves which maintain a consistent width. I had come upon this research before and, when I saw it referenced again in Jeff’s blog, it got me thinking. I wondered if I could implement this technique in Tableau in order to address this problem once and for all.

So, let’s talk about Sam’s method a bit. His post includes the following image which does a pretty good job of summing up his approach.


There’s a lot going on here, so let me try to break this down into a pieces. There are essentially three components of Sam’s curves: a set of concentric circles on both the left and right and a rectangle in the middle. The distance between the inner and outer circle is the same as the width of the rectangle. These shapes are then connected as shown below.


The excess parts of the circle are then removed, leaving partial concentric circles on each end.


This is a very clever solution because it is easy to ensure that the width remains consistent when using rectangles and concentric circles.

To create these, we need to know a few key things:

t: The width of the flow
θ: The angle from the top point of each circle to the end point where it meets the rectangle.
r: The radius of the inner circles


With these known, we can use trigonometry and geometry to calculate and plot the semi-circles and the connecting rectangle. Fortunately, t is easy to come by because it’s the width of the flow, which will be determined by some measure in our data set. But calculating r and θ are not nearly as straightforward. Here is Sam’s explanation:

In the following, I’ll describe my method for computing the curves. It could be possible to use splines, but care must be taken to ensure the flow has constant thickness at all points. In order to do this, I use a region defined by concentric circular sections, followed by a sloped rectangle, followed by the same concentric circular section region (but rotated 180°)…

Using the conventions shown in these figures, the variables must satisfy


We set r = ¼(x2 - x1) (for no other reason than that it seems to work) and solve for θ. Without loss of generality, say (x1, y2) = (0, 0). Thus, θ is determined by (x2 - x1), (y2 - y1), and t. Note that if θ0 solves the system for a choice of (x2 - x1), (y2 - y1), and t, then it also solves the system for a(x2 - x1), a(y2 - y1), and at for a scale factor a. In loose language, the same θ works for a short, thin strand as well as a long, thick strand of the same slope. This means we can eliminate one variable. If we precompute a reasonable grid over the resulting two-dimensional space, we can avoid doing any equation solving...I’ll give you this look-up table.

I have to admit that this math kind of blew my mind. I could not understand how to solve this equation in any other way than by brute force experimentation. And, unfortunately, I could not find the lookup table he referenced.

Our Solution
That’s when I decided to call in some help. I sent a message to my brother, Kevin, asking if he had any idea how to solve this equation for r and θ. He was on vacation at the time so I figured he wouldn’t have any time to even think about it. But, after a few back and forth questions, he came back with an absolutely brilliant suggestion. Instead of using the method described by Sam Calisch, what if we did something completely different? He sent me the following that he drew on his phone while lounging on the beach:


His idea was to start by drawing one connecting curve on the bottom. Because curves drawn in Tableau are really just a series of straight lines connecting dots that are very close together—so close that you cannot detect it—we could use some algebra and geometry to first find a perpendicular line, then follow that perpendicular line up until we until we reach the target width of the curve. Then we plot a point there. We continue this process for each point on our curve and, once complete, we’ll have a series of points that are exactly the same distance from the first curve. Finally, we connect those points to draw our top curve.

This was absolute genius!!! (Thanks Kev!!) Had he not came up with this solution, I fear I’d still be beating my head off of the wall trying to solve Sam’s equation. But, seeing Kevin’s solution, I knew it would work. I then fleshed out his idea a bit further on a piece of paper:


Let me expound a bit on my explanation above. Here’s the basic process we’ll take:

1) Plot the points along the bottom curve.

2) Calculate the slope of each point. A point doesn’t really have a slope (only lines do), so we’ll actually calculate the slope of the two lines connected to the point, then average them together. This averaging might cause some slight variances from an actual slope, but there are so many individual line segments that these variances will not be of any relevance.

3) Based on the slope, we’ll find the slope of a perpendicular line.

4) Convert the slope of the perpendicular line to an angle.

5) Use the angle and the radius (the width) to find the opposite point (using trigonometry).

6) Repeat this process for each point, then connect those opposite points to create the top curve.

Building it in Tableau
Note: I’m a Tableau junkie and my ultimate goal was to solve this problem and templatize it so that people can create equal-width sankeys in Tableau (I’ll get to that shortly), but I want to note here that the solution documented above could be implemented in any tool or programming language used to create data visualizations. And we believe that it has some potential benefits over some of the existing methods for drawing equal-width curves, which we’ll get to shortly.

I started out by just trying to connect two sets of dots in Tableau. I won’t be going into the calculations in detail, but I essentially implemented the steps detailed in the previous section (if you’d like to see the calcs, feel free to download the workbook—I’ve added comments to make them as easy to understand as possible). After fighting with the calcs for a while, I finally produced this:


I have to admit that I was pretty excited about this and immediately shared it with Kevin—his solution had worked!! But, after looking at it for a bit, I noticed a problem. The curve on the top right takes a wider turn than the one on the bottom left. The problem was that I was drawing the bottom curve, then using the math to draw the top curve. Because there is less space on the left/inside of the bottom portion, it squishes the curve on that end.

To correct this, we need to draw a curve in the middle instead of the bottom, then use the perpendicular line approach to extend the curve to the left/top and right/bottom. By drawing the curve in the middle, we could guarantee a uniform curve throughout.

I reworked my calculations and created this:


The difference is subtle, but this method does produce a much more uniform curve than my previous attempt.

Integrating it into a Sankey
With the concept proven, I needed to integrate this method into a sankey. While I was able to leverage a lot of the setup from previous sankeys, the new curve method required a complete overhaul of the calculations. But, in the end, I was able to make it work. Here’s a simple animation showing the differences between the old method and the new method.


While the difference is somewhat subtle, it is very noticeable in some of the curves in the middle.

And what’s great about this method is that the math will work for any type of curve and will always guarantee the same width across the entire flow. To demonstrate, here’s the same sankey using a sine curve (instead of the sigmoid curve used above):


Resizing Problems
So that solves the first problem—the flaws in the curve drawing approach. But what about our second problem? As discussed earlier, if we resize a sheet on a dashboard, it will cause distortions. Unfortunately, this new method suffers from this problem as well. Here’s an example of the new method with a narrowed worksheet:


While we do see some distortion, it’s not particularly severe—it doesn’t impact this method quite as much as the previous method. But, since our goal is to ensure consistent width along the entire curve, we’ll want to address the problem. So what’s causing this flaw? When we change the dimensions of the sheet, the width of one unit (along the x axis) is no longer the same as the height of one unit (along the y axis). So, when the sheet is thinner than it is tall, it causes the curve to become thinner the closer it gets to vertical. The opposite is true when you make the sheet wider than tall—the curve gets thicker the closer you get to vertical. To correct this issue, we’ll introduce a parameter that allows us to artificially force the height and width of one unit to be equal. I’ll explain this a bit further in a moment.

To ensure that we always have an equal-width curve, we really need the sheet to always be a perfect square after it's been placed on the dashboard—so that the width and height of one unit is always equal. That, of course, is rarely an option for most data visualization developers, so we’ll need some technique for making adjustments. Unfortunately, this cannot be automated in Tableau as there is no way to get the dimensions of a sheet from a dashboard within calculated fields. However, there is a relatively simple solution to the problem. To address this flaw, we’ll need to set a different maximum x coordinate based on the amount of “squish” we apply to the sheet. By doing this, we’ll ensure that a unit always has the same width and height.

Let me give you an example to help explain. By default, each set of sankey curves are plotted on a 1 x 1 square grid.


The above sheet is 900 px wide and 900 px tall. But, if we adjust the width to 300 px, we can see the narrowing of the grid, which causes the narrowing of the curve.


Each unit of the grid now has a width that is 1/3 (0.33333…) of its height. But if we adjust this so that the curve is drawn from an x coordinate of 0 to 0.33333, each unit of our grid will return to equal width and height—we’re just not drawing as far to the right. If we then fix the x axis to 0.33333, we’ll have something like this.


Notice that each square in the grid how has the same width and height. And, as you can see, the curve is now a consistent width throughout. So, with this ability to make these fine-tuning adjustments, we are now able to solve our second flaw as well.

The Template
With both of these problems addressed, we are now able to create sankey curves that are guaranteed to have a consistent width!! So, my final step is to templatize this approach so you can easily plug in your own data and produce an equal-width sankey for yourself. I’ve previously created templates for regular sankeys, multi-level sankeys, traceable sankeys, gradient sankeys, and sankey funnels. I won’t be providing separate templates for all of these. Rather, I’m going to provide a single template for a multi-level sankey. This template will automatically work as a sankey funnel (if you have nulls in any of your steps) and can be easily modified to work as a single-level sankey. Gradient sankeys aren’t terribly practical, so I’m leaving those out. I’m also excluding the traceable sankeys for now, but, if there is enough demand, I may consider creating traceable versions in the future.

https://public.tableau.com/profile/ken.flerlage#!/vizhome/Equal-WidthSankeyTemplate/Sankey

Like my other templates, this one includes two components—an Excel spreadsheet and a Tableau workbook. The Excel spreadsheet has two sheets, Data and Model. Model is used to handle the data densification needed to draw the curves (Note: This model is different than previous templates). You don’t need to worry too much about this sheet—just make sure it’s in your spreadsheet. Data is used to populate your data. It contains columns for each of the steps, plus a Size field for the measure you’ll be visualizing.

Next download the Tableau template. Then edit the data source and connect it to your Excel file. The workbook should update automatically to reflect your data.

The workbook comes with three different curve types—Sigmoid, Sine, and Cubic (thanks to Chris DeMartini for his work on different curve types). By default, the curve is set to use Sine, but you can change it using the Curve Type parameter.

Like previous templates, the workbook also allows you to make the whitespace configurable. You can change this using the Whitespace parameter.

To adjust your curves to account for the size of the sheet on the dashboard, you’ll need to do the following:

1) Calculate the ratio of width to height. To do this, the sheet must first be placed on a dashboard and sized to meet your needs. Next, while still on the dashboard, click on the curve sheet, go to the Layout panel, find the width and height, then divide width by height.
2) Enter this value into the Squish Ratio parameter.
3) Edit the x axis on each curve sheet, setting the “fixed end” to use this value.

And that’s pretty much it. From here, you can do whatever you like with the chart—change the colors, add filters, update tooltips, etc. just as you normally would.

I’ve placed all the files in the following publicly accessible location. I’ve included the Excel template as well as workbooks in both 2019.4 and 10.4 formats.


Wrap-Up
This was a fun (and very challenging) project. If you’ve read through this whole post, thank you for indulging my extreme verbosity. I wanted to make sure I thoroughly explained all of the issues I was attempting to solve as well as show the various iterations I took to arrive at a solution. I hope you enjoyed this read and use this new sankey approach in your work. If you have any thoughts or comments, please leave them below. Or, if you have questions, need assistance, or experience any problems, feel free to reach out to me.

Update February 4, 2020: One question I get pretty regularly is how you can extend the template so that it has additional flows. I won't go into a lot of detail here, but the process basically entails the following:

1) Add a new "Step" to the spreadsheet--you can just call it Step 6 for consistency. 
2) In Tableau, copy the "Bar" calculations. For instance, copy the calcs in the "Bar 5" folder to create "Bar 6" calculations. Then edit each of those new calculations to refer to the "6" version. 
3) Similarly, create copies of the calcs in the "Curve 4-5" folder to create "Curve 5-6" calculations. 
4) Copy the Bar 5 sheet to one called Bar 6 and Curve 4-5 to Curve 5-6, then edit the fields used on those sheets to use the Bar 6 and Curve 5-6 calculated fields.  

For a more detailed explanation of how to add additional curves, see the following post on the Tableau Community Forums: Adding Steps to a Sankey


A note of warning: Often, when you copy the sheets then switch the calcs to use the new ones you've created, you may find that one of the table calcs turns red, indicating an error. If you hover over the calc, it will say that a field is missing. Despite all your efforts, you won't find a way to fix it. I've found that advanced table calculations, especially those with lots of nested calcs, sometimes get errors like this even though everything looks good under the covers. My best guess is that there is some missing pointer somewhere that causes it to get confused. The only way I've found to deal with this is to right-click on the red pill then set it to compute using some field (any field will do). You then have to go back and edit the table calculation, setting all the nested calcs to compute as desired. It can be a bit of a painful process, but it works.


Ken Flerlage (with help from Kevin Flerlage), January 6, 2020
 

37 comments:

  1. Thank you tremendously for your efforts with this!

    I am unable to download the template files right now to see the calculations used, but as I was reading through I immediately thought back to my Calculus courses and figured you were going to implement the derivative of the curve to get the slope of the tangent... then a perpendicular line to that tangent.

    But... it seems that knowing how Tableau draws curves as many joined lines made your life even easier!

    ReplyDelete
    Replies
    1. Yep, exactly right. I really didn't want to have to figure out that math again, so I was happy that I didn't have to!!

      Delete
  2. Hi Ken, Thanks for this amazing article. When you get a moment if you can fix the download link that would be great. As of now the workbooks in the link are pointing to "AnIntroductiontoDataDensification"

    ReplyDelete
    Replies
    1. You mean the "All the Files" link? It seems to work for me. Can you send me an email showing where it's taking you? flerlagekr@gmail.com

      Delete
  3. Hey Ken,

    First of all, thank you for all your blog posts. Been a huge help over time and I love reading your material. I've been using some of the sankey templates and trying to reciprocate. In almost every case, I encounter the issue you described where the table calc turns red, specifically for the curve calculations. However, upon toggling between compute using and then resetting to the specific dimensions, the calc turns green but the curves do not connect to the right bar positions (this happens in every template I have tried). For context, I am on 2020.2. Any idea why this might be happening? I plan to build from scratch and see if the same thing happens but I am a bit baffled why it doesn't work for plug and play.

    ReplyDelete
    Replies
    1. Hi Kyle, any chance you could send me an email? flerlagekr@gmail.com

      Delete
    2. Hey Ken!
      Thanks for these Sankey Templates and all your blog posts. Right now I'm facing the same issue as Kyle. I could even copy and paste his question, lol. Have you figured out a solution for it?

      Delete
    3. I believe we found a solution but I don't remember, to be completely honest. Can you email me? flerlagekr@gmail.com

      Delete
    4. Oh gosh same issue here. Anyone find a solution?

      Delete
    5. This isn't ringing a bell. Please email me and I'd be happy to help. I'm sure we can find a solution relatively quickly.

      Delete
    6. So with a friend (Paul O, thanks) we figured this out. On the Curve 3-4 Polygon, you have to edit every single nested calculation (8 of them), some giving preference to step 3, and others to step 4. I had only been editing one of the eight (N4 position). -Chris O

      Delete
    7. Ah yes, that is a common problem. Glad you figured it out!

      Delete
  4. Hi Ken,

    Thanks for your great work on this. Is there a way to incorporate this equal-width feature into the multi-level traceable sankey? (https://public.tableau.com/profile/ken.flerlage#!/vizhome/TraceableMulti-LevelSankeyTemplate/Sankey)? I'm brand new to tableau, so I'm not sure if there's a very simple way to do this.

    ReplyDelete
    Replies
    1. That's one feature I did not add to this yet. So, if you want to do the traceable thing, it's probably best to use the original version.

      Delete
  5. Any suggestions on creating a Sankey that has not got curves? i.e. parallelograms that join the bars together. I thought it would be as easy as updating the "curve x" and "curve y" formulas, but I'm now tying myself in knots trying to figure it out

    ReplyDelete
    Replies
    1. In that case, this equal-width sankey option isn't really necessary since straight lines won't suffer from the same problems as curves. If you check the blog by Jeff Shaffer that I referenced earlier (https://www.dataplusscience.com/Sigmoid.html), you'll see that his template has a number of different curve types, including "linear" which does exactly what you need.

      Delete
  6. Thank you Ken. I have successfully downloaded and able to use the Sankey diagram. However, there is one small item that doesn't seems to be working properly. If you look at my screenshot at https://ibb.co/Njk3qyj you can see that mouseover highlight doesn't work all the way. It works for the first two sections but doesn't work for 3rd and 4th sections. Please let me know if I am missing anything there. Thank you very much. Sohail Ansari

    ReplyDelete
    Replies
    1. I'll have to take a look at that and get back to you. Thanks for pointing this out.

      Delete
  7. Say you wanted to rearrange the order of the flows--how would one go about doing that?

    ReplyDelete
    Replies
    1. You'd have to change the sort order of the steps on both the bar and curve sheets. Does that make sense. Happy to provide further help if you could email me. flerlagekr@gmail.com

      Delete
  8. Hi Ken,

    Sorry for what I assume is a rather simple question, but I'm finding most labels larger than the 'A','B'...'Z' used in the template are getting cut-off. For example, one bucket uses "Personal Income Tax," but most of it is truncated. Is there anyway to get the labels to spill over to the side or be listed to the side of each bar?

    ReplyDelete
    Replies
    1. Labeling is a bit of a problem, unfortunately. You typically would need to make the bars larger or change the orientation (vertical, for example). I haven't found a good way to do it other than those options.

      Delete
  9. Thank you so much for your post! I was wondering if you allowed this tempalte to be used in commercial products and if so what information needed to be included with it. Thanks!

    ReplyDelete
    Replies
    1. Yes, you are free to use it however you like. I always appreciate a mention somewhere, as well as a link back to the blog (so others can see how to create it themselves), but that's not strictly required.

      Delete
  10. Thank you so much for your work!

    ReplyDelete
  11. Hi there I love your work! If you are dealing with summarized data that has different periods (in my case, 28 months of data), how would you modify the Size to Show calculation so that it displays the correct value? I am capturing customer behavior at 4 different points in time (each 7 months apart) and showing that with your template but my “size to show” values are all off and I’ve been trying to figure out the solution but I’m not having any luck. Any help you might have would be much appreciated! Thank you again for sharing all of your brilliant work!

    ReplyDelete
    Replies
    1. Probably best to reach out to me via email. flerlagekr@gmail.com

      Delete
  12. This is amazing! Thank you Ken!

    I was wondering, If I had the data like this:

    Step 1 S1 Total Step 2 S2 Total
    A 587 D 669
    B 633 E 823
    C 887 F 600
    G 15

    How do I get it partitioned like you have it on the Data tab?

    Thanks so much!

    Maribel

    ReplyDelete
  13. Thanks so much for this work, it is extremely helpful! Just want to add my +1 for a traceable template :)

    ReplyDelete
  14. Hi,

    I was wondering how I can skip a step in the flow. I'm using tracable multilevel sankey template and have a flow that diverts into two with different number of stages before merging into same level later on.
    I am beginner level for tableau and haven't gone into the process of understanding your sankey template, but I am good with excel, so can try for somewhat complicate solution if needed. (Although, i might wonder if this is worth doing so for my usecase.)

    Thanks for all your work. This is really awesome!!

    ReplyDelete
  15. Hi Ken -- I am unable to find the location of the "fixed end" for the x axis that you reference at the end of the article. What am I missing?

    ReplyDelete
    Replies
    1. See the "Edit an Axis Range" section on the following: https://help.tableau.com/current/pro/desktop/en-us/formatting_editaxes.htm

      Delete
  16. I am having a bit of trouble creating additional steps. I can recreate bar 5 as 6 with the new calculated fields. However, when I duplicate Curve 4-5 and replace the applicable new calculated fields and edit all the table calculations that drive C5-6 X Final and C6-7 X Final, those pills are red and I get this message when I hover: "The C5-6 X Final table calculation requires a field that is missing from the view. Add the STEP_1 field to the view to resolve"
    Note: When I was editing the table calculations after replacing C4-5 X Final with C5-6 X Final I had to uncheck STEP_1 and STEP_2 and check STEP_5 and STEP_6. When doing so STEP_1 and STEP_2 would disappear from the list as options, but it seems as if they are still "in the background" somewhere causing issues?
    When I add STEP_1 and STEP_2 as details the curves populate, but they are incorrect.
    Any suggestions?
    Unfortunately my data is sensitive and I cannot share the workbook.

    ReplyDelete
    Replies
    1. Can you try the steps at the end of the blog titled "A note of warning"? If that doesn't fix the problem, then send me an email. flerlagekr@gmail.com

      Delete
  17. Hey guys! I love this approach and have used it a number of times. I was able to update this approach to use a relationship instead of a join and it greatly reduced the number of rows returned (instead of multiplying by 100). You can also update the [Size to Show] to be SUM(Size) instead of having to account for the multiplication of rows.

    ReplyDelete
    Replies
    1. Ah yes, makes perfect sense. I created this shortly after relationships came out.

      Delete

Powered by Blogger.