Ads Top

Regex Part II: Basic Examples & Exercises

We’re excited to have Don Wise back for part 2 in his series on regular expressions. Don is a Tableau Forums Ambassador, has an incredibly deep knowledge of Tableau as a whole, and is one of the top contributors on the Tableau Community Forums. If you’ve ever asked a question on the forums, it’s quite likely that you’ve been helped by Don. Recently retired from the Public-Safety arena, Don has extensive background in Fire/EMS and 9-1-1 operations and related data at local, State and Federal level. You can get in touch with Don either on the forums or through his website, www.dcwise.net.

 

 

Hey! Welcome to our next segment on Regular Expressions. If you’ve made it past that first section, An Introduction to Tableau Regular Expressions (REGEX), then you’ve decided to take the plunge and learn a bit more with some practical examples. If you haven’t read the first blog and are not yet familiar with the basic components of regular expressions, then I highly recommend going back and reading it as we’ll be building upon that knowledge here.

 


Examples

We’re going to continue on with some additional examples of specific pattern matching using the same set of data as in the first blog. We’re using the following simple set of student course data:

 

Text

Smith, Paul is a student of English, Student ID: ABC123

Jones, Mary is a student of Math, Student ID: DEF345

Lee, Sally is a student of Economics, Student ID: GHI678

Brown, Sam is a student of Music, Student ID: ABC345

Black, Kelley is a student of Business, Student ID: ABC123

 

Just copy those lines into Excel or Google Sheets, then use either Tableau Desktop or Tableau Prep to connect to your spreadsheet. 

 

We will establish 4 goals for cleaning this data:

 

1) Get the person’s first and last name.

2) Get the person’s Student ID.

3) Get the person’s Course.

4) Determine if any of the ID’s are duplicates.

 

First and Last Name

Now we need both their first name and last name together. We could just concatenate the first name and last name that we’ve already extracted in the first post, but that would be taking the easy route—let’s do it with regular expressions instead. We’re going to work with something slightly different from the prior examples:

 

REGEXP_EXTRACT([Text], '([a-zA-Z]+,\s+[a-zA-Z]+)')

 

 

Oh wow—what in the world is this???!? We’re actually using what’s known as a “set expression” with a range of values that can be matched when they are enclosed in brackets [ ]. So, the first bracket group is matching on anything that is a letter whether lower case (a-z) or upper (A-Z). Of course, we already know that the + sign continues the pattern match on any additional word characters thereafter, so that part of the REGEXP matches on “Black” for example.

 

Thereafter, we use the comma which separates the last name and first name followed by our search/match on any white space \s+. Then we introduce the same set expression again followed by the + sign to get the first name. Which matches the entire string (last name, first name). Not nearly as difficult as it looks, is it?

 

Could we have done this from the beginning when we wanted just our last name out of the string? We sure could! As in Tableau, there are multiple ways of getting to the same or similar result.

 

Student ID

Alright, we now need the students’ ID. Tricky? Maybe! Let’s use the following syntax: 

 

REGEXP_EXTRACT([Text], ':\s+(\w+)')

 

 

Looks pretty similar to what we did earlier with getting the person’s first name, doesn’t it? Knowing what you now know, take a few minutes to figure out what this is doing? Just take your time and break down each of the components one-by-one.

 

Do you have it? We know the ID appears after the colon : so that’s the target start of our expression, followed by our standard white space match using \s+ and, of course our capture group of (\w+)where the + sign expands our match to the end of the “word”. 

 

Notice that \w+ will also capture numeric characters. What if only wanted to capture the numeric digits of the ID? Well, in this instance, because the text does not contain any other numeric digits, we’d simply use the following:

 

REGEXP_EXTRACT([Text], '(\d+)')

 

 

The \d will match any digit from 0-9. And the + sign does what? Correct—it expands our match for any subsequent character matches.

 

Course

We need to get the person’s course of study for our next challenge. This syntax is a bit different from the rest and somewhat more complex. We’ll use the following: 

 

REGEXP_EXTRACT([Text], ',.*?(\w+),')

 

 

What-the-what? Here we’re working outside of our capture group again. We need to tell the regular expression that we want to start somewhere OTHER than exclusively the first comma. In fact, we’re using two commas to help the pattern match—we start and end with a comma. So, a bit of a “between” match here. OK, but how does that get us to the word matched?

 

Let’s figure out what’s going on after the first comma. We’re introducing three new hieroglyphs here: a period . an asterisk * and a question mark ?, followed by our (\w+) capture group. This particular instance of symbols are used in combination with each other.  Let’s break this down into its components.

 

. – Matches ANY character.

 

*? Matches on the previous token between zero and an unlimited times, but as few as necessary, expanding only as needed. This is what’s known in regular expressions as a “lazy match.” In other words, match the word characters (capture group) found nearest the last comma.

 

That was a tricky one, but spend a bit of time with it and I think it will start to become clear to you.

 

Duplicate IDs

Alright, we’re on to our last requirement! We need to determine which of our students have had their ID’s duplicated. In this case, our Student Affairs department has advised us that they knew some of the Student ID’s containing ABC were duplicate, but they were unsure of how many. Unfortunately, this kind of thing happens—it shouldn’t but it does. We’ll use the following: 

 

REGEXP_MATCH([Text], 'ABC')

 

 

Using the REGEXP_MATCH() function returns just TRUE or FALSE—we either found a match or we did not. In this example, there is no need for a capture group as it’s a simple search on the value, ABC.

 

Of course, the match is on any value that contains ‘ABC’. That could be a good starting point for determining the issue. Thereafter, filtering to just TRUE, or out of those that return TRUE, simply defining the match to the following would return the two that match explicitly.

 

REGEXP_MATCH([Text], 'ABC123')

 

 

Of course, we’re not forced to only work with regular expressions here. We can combine other Tableau functions to get the result we’re looking for. In this case, we could use LOOKUP to check the previous value or the next value to see if they are a match.

 

IF ATTR([Student ID]) = LOOKUP(ATTR([Student ID]), -1) OR

   ATTR([Student ID]) = LOOKUP(ATTR([Student ID]), 1) THEN 

    TRUE 

ELSE 

    FALSE

END

 

 

Note: We’d need to make sure our table is sorted by ID for this to work.

 


Wrap-Up

We did it!! That was quite a bit to take in for one sitting, but hopefully you’re now a bit more interested in how regular expressions work. And hopefully you have a better understanding as to how they can be helpful in parsing data in ways that cannot be easily done using other Tableau functions. 

 

I'll be back at some point in the future with with some of the more interesting data challenges found in the Community Forums as real-life examples of issues resolved with regular expressions. In the meantime, if you’d like to learn more or practice, here are a few resources that I particularly like:

 

RegEx 101: Online Tester 

RegExr: Online Tester

RegEx: Online Tester

RegExOne: Interactive Tutorial

Learn-RegEx: Cool Interactive Method

Sitepoint Learn RegEx: A Beginners Guide

 

No comments:

Powered by Blogger.