tag:blogger.com,1999:blog-3178813192704651028.post6291577040857282304..comments2024-03-04T13:42:59.659-05:00Comments on The Flerlage Twins: Analytics, Data Visualization, and Tableau: An Introduction to Data DensificationKen Flerlagehttp://www.blogger.com/profile/03698843288892226027noreply@blogger.comBlogger9125tag:blogger.com,1999:blog-3178813192704651028.post-82408910577955780132022-11-21T09:51:51.013-05:002022-11-21T09:51:51.013-05:00Could you email me? flerlagekr@gmail.comCould you email me? flerlagekr@gmail.comKen Flerlagehttps://www.blogger.com/profile/03698843288892226027noreply@blogger.comtag:blogger.com,1999:blog-3178813192704651028.post-49179767461520719382022-11-20T08:09:38.093-05:002022-11-20T08:09:38.093-05:00I took my first shot at data densification this we...I took my first shot at data densification this weekend. I picked a pretty complicated example. Basically, I have the dispersion (in x and y) of a number of golf shots. I wanted to have an ellipse that showed how many shots were contained within a certain confidence interval. <br /><br />You can see my attempt at https://public.tableau.com/app/profile/steve7299/viz/Ellipse_16689447773940/Dashboard1?publish=yes<br /><br />It took me a couple of days. And, it is VERY clunky. And, I'm not sure it's quite right either. <br /><br />I wanted to plot the x and y coordinates of an ellipse with an 80% confidence interval. I used the standard equation for an ellipse, calculating the x and y coordinates using the angle, factored in the slope (I think the ellipse's longest axis should go along the line of the slope but I guess it depends on the variance of the x and y data - the one with the larger variance should have the longer axis), and moved the center of the ellipse to Xbar and Ybar. <br /><br />The confidence interval of the ellipse can be changed in the calculations for the a and b parameters of the ellipse. For both a and b, you just need to multiply by the constant for the chi-squared statistic for the probability you want (basically, how many points should be inside the ellipse versus outside of it).<br /><br />I tried to densify the data by 360 (one for every degree). When I did this it through off the values of my original xy scatter plot. So, I divided those by 360. (Does data densification always do this or did I do it wrong? This is my first attempt at data densification.) It's hard to tell, but my ellipse doesn't fully close as the last point doesn't connect back to the first point.<br /><br />I couldn't figure out how to get my scatter plot and ellipse on the same chart (perhaps I need to duplicate the data to do that). So, I tried overlaying the ellipse as a transparent worksheet on the xy scatter plot of the original chart on a dashboard. I did this very crudely (haven't used dashboards before).<br /><br />Any help on streamlining this or making it look better would be greatly appreciated.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-3178813192704651028.post-64788417975619263542020-12-17T16:22:20.268-05:002020-12-17T16:22:20.268-05:00That's a really good question, Katie. My stand...That's a really good question, Katie. My standard answer for this is that, if you have that much data and want to create a sankey, your best bet will be to aggregate your data. I still believe that's the best approach as sankeys are always going to aggregate your data to a certain level--you could never show all of those 5M records in a sankey. That said, I do realize that sometimes, it's easier to bring in the raw data (or you may need it for other charts). The idea of using relationships is interesting. I haven't tried this myself, but in theory, it should work. If you try it, let me know!Ken Flerlagehttps://www.blogger.com/profile/03698843288892226027noreply@blogger.comtag:blogger.com,1999:blog-3178813192704651028.post-76409992826801301692020-12-16T17:18:46.984-05:002020-12-16T17:18:46.984-05:00Thanks for this great article and explanation, Ken...Thanks for this great article and explanation, Ken! I have a follow-up question: I have a ~5M record data set for a sankey (using the "model" tab of your sankey excel data template) - would Relationships per the new Data Model work here, instead of cross-joining, then taking an extract? I'm wresting with extract creation problems since the data set is huge after it's densified. I'm thinking that since the join doesn't happen until viz run-time with Relationships, this might allow me to create a much smaller extract of my data, then join it to the sankey model data for densification as the sankey is being rendered. What do you think? Any other strategies you might recommend? <br /><br />Thanks again! <br />KatieKatienoreply@blogger.comtag:blogger.com,1999:blog-3178813192704651028.post-55857370384484875282020-03-30T17:53:24.822-04:002020-03-30T17:53:24.822-04:00No I would not recommend applying this technique i...No I would not recommend applying this technique if you had millions of records in your database. There are other methods of doing data densification that do not require duplication (essentially, using the data itself for the binning mechanism), but when you're using bins and table calculations with millions of records, that will also have a huge negative impact on performance, as I noted in the "Which is Best" section. So I wouldn't necessarily recommend that option either.<br /><br />So, to more directly answer your question--if you needed to perform a technique like this on millions of records, then in most cases, you would not be trying to create a single line (or in the case of a sankey, a polygon) for each record. Thus, I'd recommend that you aggregate your data ahead of time. That will reduce the records to the point where the duplication of those aggregate records would not have a severe impact on performance.Ken Flerlagehttps://www.blogger.com/profile/03698843288892226027noreply@blogger.comtag:blogger.com,1999:blog-3178813192704651028.post-61647554063794155822020-03-30T02:59:11.928-04:002020-03-30T02:59:11.928-04:00Very interesting article, but (in my humble opinio...Very interesting article, but (in my humble opinion) not very pragmatic. In this example (and many more that I've seen) you're increasing the data set by a factor of two. Would you still apply the same method to a data set that contains more than a million rows of data? My guess would be that the average user would experience some loss in performance on Tableau Desktop. Then spinning this up to a server for a dashboard that has several other visualizations? I love the article, I just don't a see a practical application when the first step is "double your data set."Anonymoushttps://www.blogger.com/profile/12707048217162730069noreply@blogger.comtag:blogger.com,1999:blog-3178813192704651028.post-38327186126399987522019-07-26T01:43:13.254-04:002019-07-26T01:43:13.254-04:00Hi Ken, Thanks for the great articles!Hi Ken, Thanks for the great articles!Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-3178813192704651028.post-46941645215482804882019-07-20T13:09:28.565-04:002019-07-20T13:09:28.565-04:00 Zen Master Ken,Amazing Skills Sharing.Thank you Zen Master Ken,Amazing Skills Sharing.Thank youUnknownhttps://www.blogger.com/profile/15498195859963403838noreply@blogger.comtag:blogger.com,1999:blog-3178813192704651028.post-64034279004335453232019-05-20T14:03:32.841-04:002019-05-20T14:03:32.841-04:00Ken, this is meticulous work again. Thank you very...Ken, this is meticulous work again. Thank you very much for sharing your knowledge.<br><br>PS: I'd never thought of seeing the sigmoid function in the context of 'imputation.'Unknownhttps://www.blogger.com/profile/09147508665161980832noreply@blogger.com