Tackle Data Dictionaries with Tableau Prep

Many business systems have adopted field naming conventions in order to ensure consistency across fields. For example, I used to work with JD Edwards. One of the most common tables is F4211, the Sales Order Detail table. The table has over 250 columns. Here are the first 10:


SDKCOO
SDDOCO
SDDCTO
SDLNID
SDSFXO
SDMCU
SDCO
SDOKCO
SDOORN
SDOCTO

Each column name is prefixed with “SD” (for “Sales Detail”) and followed by a Data Dictionary key. For example, KCOO stands for Order Company, DOCO is the Document Number, DCTO is the Document Type, and LNID is the Line Number.

Every table in the database has this structure. The reason for this design is that it makes it easier for developers to ensure consistency across the entire database. For example, Line Number will always use the LNID data dictionary item, regardless of what table it’s in. 

Systems that leverage such data dictionaries also commonly have other strange constraints. Some may store numbers without the decimal points (as implied decimals), some may use different formats for dates, etc.

Of course, JD Edwards is just one example. Most big systems tend to use data dictionaries—as do many smaller systems—since it makes managing the database much easier. Unfortunately for us, it makes analytics really difficult as you need to change the column names to more logical, business-focused names and often need to perform various transformations on the data in order to address date formats or implied decimals, for example.

The best way to address this, in my opinion, is by building a data warehouse. In your data warehouse, you can give the fields logical names and perform all the various transformations. However, not every organization has made an investment in a data warehouse and even those who have may find themselves in need of a direct database connection from time to time. So, in this blog, I’m going to talk about some of the challenges of using data controlled by a data dictionary. Then I’ll show you how to use Tableau Prep to transform your data into a more usable format in an automated way.

Data Dictionaries
In this example, I’m going to work with Superstore data. But, I’ve modified the data to be structured according to a data dictionary. Here’s the dictionary for my table:


And here’s a snippet of the Orders table converted to this structure.


In this case, my fields are all prefixed with “OR” (for “Orders”), followed by my four-digit data dictionary item.

---------------------------------------
Note: In order to make it easier for you to work with this data, I’ve created both the Orders table and the Dictionary table in my publicly-accessible SQL Server database, which you can access using the following details:

Server Name: 3.143.125.139
Authentication: SQL Server Authentication
Login: SQL
Password: SQL

The Orders table is called Orders with Dictionary and the dictionary table is called Dictionary.
---------------------------------------

While we could certainly work with this directly in Tableau, it presents us with a few challenges. First, the field names are basically just codes. In order to make this more usable to us as developers and, more importantly, our end users, we will want to rename the columns to use business terminology. That will require us manually looking up each data dictionary item and then providing a better name.

We’ll also need to address any differences in the way fields are stored. For example, ORORDT (Order Date) is stored in a normal data format, while ORSHDT (Ship Date) is stored in MMDDYYYY format and ORIVDT (Invoice Date) is stored in YYYYMMDD format. And do you see anything strange with ORSAMT (Sales Amount)? On the surface, it just looks like a normal number, right? But, if you look at the dictionary, you’ll see that it actually uses 2 implied decimals, which means that a number stored as 66588 is actually 665.88. I’ll tell you right now that you don’t want to be the person who overstates sales by 100 times!! So, to make these adjustments, you’ll need to manually refer to the data dictionary then create calculated fields to transform these fields into the correct values.

While renaming fields and performing the various transformations on data types may not be a huge deal for a small table like this one, it can be an absolute nightmare for larger tables. Imagine manually adjusting all of the 250+ columns in the F4211 table I mentioned earlier!!

Tableau Prep
So, is it possible to create an automated way of transforming your data based on a data dictionary? Yes—we can do it with Tableau Prep! I’m going to take you through the process step by step.

First, we’ll connect to our Orders table. Next, in order for this process to work, we’re going to need a field that acts as a unique identifier. If your table already has a unique ID then you can use it. If your table does not have a single unique ID field, you can use a combination of fields that defines a unique key (though it’s easier to create a single field as I’ll show you momentarily). If there is no unique key, then you can analyze your data to see if you can find some sort of combination of fields that is unique—if that’s not possible, then this method won’t work…sorry!

My orders table does have a unique ID field already, ORROID (Row ID), so I could use that, but I’d like to show you how to create an artificial ID using Tableau Prep’s new analytical function, ROW_NUMBER. Again, this doesn’t really make a lot of sense since I already have Row ID, but let’s create one anyway. To do this, I’ll create a calculated field like this:

{ ORDERBY [ORORDT] ASC, [ORORID] ASC: ROW_NUMBER()}

This will essentially sort the data by ORORDT (Order Date) then by ORORID (Order ID), in ascending order, then create a numeric row number. Note: You can order the data however you like using ORDERBY and you can also partition the data in order to restart the row numbering. 

Note: The above is only available in version 2020.1.3 and later. For more details on the above syntax, see Create LOD and Rank Calculations.


Through the rest of our steps, we’ll use ID as a unique identifier, which will allow us to pivot and unpivot the data, while keeping the data together in a single record.

Our next step will be to pivot our data. We’re going to pivot every field except ID so that we can then join to the data dictionary.



Note: You may have noticed above that numeric values look a bit strange after the conversion to text. For example -0.64 becomes -0.64000000000000001. This has to do with the fact that they are stored as floating point numbers. Floating point numbers do not maintain exact precision, which means that, internally, they are not stored as exactly the number they mean to represent. In most cases, this should have little to no impact on you. However, you should be careful in situations where a high level of precision is required. If, for example, you need precision to 10+ decimal points, then this method of transforming your data may increase the risk of slight variances and should probably be avoided.

Now we’ll add a clean step and do a bit of cleanup. First, I’ll rename Pivot1 Values to Values. Then, I’ll create a new field called Fields which will remove the two-digit prefix from the field names:

// Remove two-digit prefix from field names.
MID([Pivot1 Names],3)

Finally, I’ll remove Pivot1 Names:


Now that we have fields with just the data dictionary item in them, we can join to the Dictionary table on Fields = Dictionary.


We now have all the details for each data dictionary item and can use them in our transformations. The first thing we’ll do is perform the date and implied decimal conversions. The number conversions should be fairly easy as we’d just divide the number by 10d where d is the number of decimals. But, when I tried this, Tableau Prep gave me a collation error. This seems to be due to my converting a string to a number then back to a string. I’ve reported this to Tableau so they can correct it in a future version. But, for now, we’ll need to use a method that avoids conversion to numbers. Unfortunately, this a little more tricky. I’ll do it in two steps. First, I create a calculated field called Values - Decimal Step 1:

// Step 1: Pad with enough zeros so we can move the decimal.
IF [Type]="Decimal" THEN
     // Deal with implied decimals.
    IF LEFT([Values],1)="-" THEN
         "-00000000" + REPLACE([Values],"-","")
    ELSE
         "00000000" + [Values]
    END
END

We’re padding the number with leading zeroes just to make sure we have enough space to move the decimal. Now, we’ll finish the transformation by creating another calculated field, Values – Decimal:

// Step 2. Move the decimal point.
IF [Type]="Decimal" AND [Implied Decimals]>0 THEN
     LEFT([Values - Decimal Step 1],LEN([Values - Decimal Step 1])-[Implied Decimals]) + "." + RIGHT([Values],[Implied Decimals])
ELSE
     [Values - Decimal Step 1]
END

This essentially just breaks the string at the decimal point, moves it to the left by two spaces, then reconstructs the string.

For the dates, we’ll use the following calculated field, Value – Dates, to convert from the internal formats. Note: I had the same collation issue when I converted these to dates then back to strings, so this calculation does everything with strings only.

// Put dates in YYYY/MM/DD format.
IF [Type]="Date" THEN
     CASE [Date Type]
     WHEN "MMDDYYYY" THEN RIGHT([Values],4) + "/" + LEFT([Values],2) + "/" + MID([Values],3,2)
     WHEN "YYYYMMDD" THEN LEFT([Values],4) + "/" + MID([Values],5,2) + "/" + RIGHT([Values],2)
    ELSE [Values]
    END
END

For dates that are not stored as dates, this will convert them to YYYY/MM/DD which can easily be converted to a date by Tableau. For those fields already in a date format, we’ll just leave the field as is.

Finally, we’ll create another field, Values – Other, for the remaining data types that do not require any transformation.

// Values for other fields.
IF [Type]<>"Date" AND [Type]<>"Decimal" THEN
     [Values]
END

We now have the following:


In another clean step, we’ll merge these three values fields into a single field.


This field will take the name of the first field. In my cases, this is Values – Other. So, I’ll rename this to Values – Final. We can then remove all the other Values fields.


Now that we’ve done all of our transformations, the only fields we need are ID, Description (will become our new field name), and Values – Final, so we’ll remove all the other fields in a new clean step.


If you’ve followed along this far and are confused at how we’re going to get this back into a columnar format, this next step is where it should all come together. We’re going to add in a pivot step then pivot all of our rows back to columns using Description as the new field name. In the pivot step, we’ll change it to a “Rows to Columns” pivot, then drag Description to Pivoted Fields. Tableau Prep will then enumerate all of the values in this field so it can convert them to field names. We’ll also drag Values – Final to Field to aggregate for new columns. By default, Tableau Prep will use a count aggregation. Change this to MIN or MAX. While this may seem like you’re going to eliminate some values, it won’t. This is because we created that unique ID in the first step and the aggregation will always operate at that level of detail. Our data now looks like this:


Our columns now have nice names and the data has been converted from internal formats to formats that can be recognized by Tableau!!

Warning: It was pointed out to my by Jonathan Drummey that pivoting rows to columns is not truly dynamic, meaning that as new row values are added, the pivot will not pick them up automatically and create new columns. For this blog, that simply means that columns that are added to the table will not be picked up automatically. If new columns are added, you'll need to recreate the rows to columns pivot step. Otherwise, this should have limited impact. Note: If you'd like to see the rows to columns pivot be dynamic, please up-vote the following: rep Rows to Columns: dynamically add and/or remove pivoted values based on the current data


Our last step is to convert the data types as all of our fields are currently strings. There are a couple of options here. One of the things I dislike about file-based data sources (Excel, csv, etc.) is that they do not enforce data types. This means that Tableau has to guess the data types. But, in this case, we can use this to our advantage. If you output the data to csv, then connect to the csv in Tableau Desktop, it will guess at the data types and, because we’ve formatted numbers and dates in a certain way, Tableau should convert them quite well as shown below:


But, in many cases, an extract may be more desirable. In that case, my recommendation is to add a prefix to the Description field upstream in the Prep flow, which will indicate the data type.

// Prefix the field name with an indicator of the data type in order to make it easier to change data types at the end.
LEFT(LOWER([Type]),3) + " " + [Description]

Then, before you output the file to hyper, manually change all the data types. Obviously, this is still a manual process, but you’ll only need to change those fields that are not strings (numbers, dates, etc.) and they will be easily identifiable due to the prefix added above. The result looks pretty much the same as the CSV output.

The Final Product
Finally, just to help you see the differences, here's a before and after of the original data vs the final output.






So, there you have it. By pivoting our columns to rows, joining to the dictionary, performing various transformations, then pivoting the rows back to columns, we can create a clean data set in a mostly automated fashion. That said, this method isn’t perfect. Our data set was relatively small (about 10,000 records), but the process of pivoting every column of a large data set could put a lot of pressure on Tableau Prep. Imagine if our F4211 table had 1 million records. When we pivot its 250+ columns to rows, that will result in 250+ million rows! But, in many cases, this should be a viable approach. And, once you create the basic structure of your flow, you should be able to easily adapt it for other tables in your data source.

If interested in seeing my workflow, I’ve made the CSV and Extract flows available here: Packaged Workflows


Ken Flerlage, February 10, 2020

4 comments:

  1. Ken, thank you for giving more details on unique identifier. I've been looking for this one for some time. Basically, now I can get rid of my manual calcualted fields retreiving 'Unique Value' in Excel.

    ReplyDelete
    Replies
    1. Yes, this is a great new feature. Opens up so many new possibilities!

      Delete
  2. I am trying to adapt a data dictionary to update description and values which are often numeric codes for categorical variables. Is this crazy talk? I primarily use giant Census datasets...

    ReplyDelete
    Replies
    1. Yeah, that's a common scenario that I intended to help with this blog. It should work, I think. But if the data sets are huge, that could be a bit problematic since it's going to pivot all those columns to rows first. This sounds like an interesting use case, so I'd be happy to take a close look if you'd like to email me directly. flerlagekr@gmail.com

      Delete

Powered by Blogger.