An Introduction To Tableau Regular Expressions (REGEX)

Kevin and I are incredibly excited to have Don Wise join us today for the first of a multi-part series on regular expressions. I spend a lot of time on the forums and I’ve personally run into many questions that require complex string parsing. I can typically recognize them as problems that can be solved using regular expressions, but I’m not very skilled with them, so I often find myself calling in Don. I’m always amazed at his ability to create these expressions and explain them in a way that is easy to understand. After my last encounter with one of Don’s brilliant solutions, I asked if he’d be willing to write a blog on the topic. And, lucky for all of us, that blog turned into this amazing series.

 

Of course, Don is more than just the resident regex expert. He is a Tableau Forums Ambassador, has an incredibly deep knowledge of Tableau as a whole, and is one of the top contributors on the Tableau Community Forums. If you’ve ever asked a question on the forums, it’s quite likely that you’ve been helped by Don. Recently retired from the Public-Safety arena, Don has extensive background in Fire/EMS and 9-1-1 operations and related data at local, State and Federal level. You can get in touch with Don either on the forums or through his website, www.dcwise.net.


 

Introduction

Have you ever wondered what those Tableau functions starting with REGEXP are all about? Ever wanted to do something that might be handled with a SPLIT() function, but doesn’t exactly turn out how you expected? Or perhaps believe those REGEXP functions are a bit intimidating (sometimes they look like Egyptian hieroglyphics!!)?

 

Yeah, I’ve been there on all those occasions. So, I’m with you. As a bit of background, I’m not a programmer, never written a line of code (other than a Tableau calculation or two), I do know and understand SQL syntax and enterprise-level systems just enough to be dangerous.

 

Yes, I’m mostly a layperson with a basic understanding of data structure, data analysis, and related tools to clean and reshape data…so, this series is for all of us who’d like to understand a bit more what those functions are all about. And, when/how to use them!

 

I think one of the reasons I leaned into learning REGEX expressions is that they follow patterns. Really, that’s exactly what they dothey help pull out (extract) or match patterns in strings of data. At the beginning of my quest to learn REGEX, it all looked like a bunch of gibberish. Completely non-understandable. Now, I can read and build those hieroglyphics with more confidence and understanding and I’m going to use this blog series to help you understand and write them as well.

 

Will this series resolve all your REGEX data manipulation questions? Probably not. But it’ll be a good starting point on your road to start using them in Tableau. We’ll supply some related links and online tools along the way as well as some relevant links to Tableau’s Help sections. Most importantly, many of the examplesfrom the basic to intermediate to more advancedwill use actual solutions as found in the Tableau Community Forums.


 

History of Regular Expressions

Regular Expressions, often abbreviated as REGEX or REGEXP, have been around since the 1980’s. They were the outgrowth of Regular Language developed in the 1950’s, which was used primarily in UNIX-based systems for processing text. The syntax is slightly different from system to system, so what I’ll be sharing here is focused on Tableau.

 

The primary use case for regular expressions is extracting a particular data element from a string of data. Regular expressions are commonly used by website and application developers to validate whether the user input matches the required structure of the data element. An example that we encounter on almost a daily basis is an email address. Email addresses, for example, require a user name, followed by the @ sign, followed by a domain (user-name@domain-name.com). We could use REGEXP_MATCH to validate if the user input for that data element/field was entered properly and display an error message if it didn’t match the required format correctly.


 

How Do Regular Expressions Work?

Ok, so we know what regular expressions are, but how do they work? Well, like any “language” or format, there are components that help build the syntax or composition of an expression, much like building a calculation in Tableau using IF-THIS-THEN-THAT type syntax. When built correctly, the underlying algorithm in Tableau will translate the pattern and provide a match or extract the data. Whenever there isn’t a match to the input pattern, Tableau returns Null.

That means we need to understand what builds/supports the syntax. In other words, what do those hieroglyphics mean?

 

A quick word of warning: This section may feel a bit overwhelming as much of the terminology and the definitions will be foreign to you. But that’s normal so please do not worry too much about it. I just need to cover some of these basic building blocks before showing you how regular expressions work in practice.

 

A regular expression can contain 3 distinct pieces. Note: Not all expressions will require all three, it depends on the requirement.

 

1) Regular Expression Metacharacters ClassesA Metacharacter is a character that has special meaning to a computer program, such as a shell interpreter, or in our case, a regular expression (REGEX) engine.

 

2) Regular Expression Operators/QuantifiersThese are used to refine the pattern.  For example, how many times should the pattern match be repeated.

 

3) Set Expressions (Character Classes)These are used to help define specific characters that we want to match.

 

Regular Expression Metacharacters Classes

In regular expressions, using the backslash with certain letters causes certain characters to do something. A capitalized letter will have different meaning and action than that of a lower case:

 

Character

Description

\b

Match if the current position is a word boundary. Boundaries occur at the transitions between word (\w) and non-word (\W) characters, with combining marks ignored.

\B

Match if the current position is not a word boundary.

\d

Match any character with the Unicode General Category of Nd (Number, Decimal Digit.)

\D

Match any character that is not a decimal digit.

\w

Match a word character.

\W

Match a non-word character.

.

Match any character.

^

Match at the beginning of a line.

$

Match at the end of a line.

\

Quotes the following character. Characters that must be quoted to be treated as literals are * ? + [ ( ) { } ^ $ | \ .

\

Quotes the following character. Characters that must be quoted to be treated as literals are [ ] \ Characters that may need to be quoted, depending on the context are - &

 

Regular Expression Operators/Quantifiers

The function of operators/quantifiers is different than their literal meanings. For example, the + sign would mean addition to us normally, but in regular expressions, it can mean do somethingin this case, perform one or more repetitions of a match. Below is a non-exhaustive list:

 

Operator

Description

|

Alternation. A|B matches either A or B.

*

Match 0 or more times. Match as many times as possible.

+

Match 1 or more times. Match as many times as possible.

?

Match zero or one times. Prefer one.

{n}

Match exactly n times

{n,}

Match at least n times. Match as many times as possible.

{n,m}

Match between n and m times. Match as many times as possible, but not more than m.

*?

Match 0 or more times. Match as few times as possible.

+?

Match 1 or more times. Match as few times as possible.

??

Match zero or one times. Prefer zero.

{n}?

Match exactly n times.

{n,}?

Match at least n times, but no more than required for an overall pattern match.

{n,m}?

Match between n and m times. Match as few times as possible, but not less than n.

 

Set Expressions (Character Classes)

 

Example

Description

[abc]

Match any of the characters a, b or c.

[^abc]

Negation - match any character except a, b, or c.

[A-M]

Range - match any character from A to M.

 

As an example, the below graphic breaks out the syntax/schema of all three pieces which can be used to validate or match on a Social Security Number:

 

 

OK, that’s a bit to take in and understand, for now. But before going on to the next section, please take a moment to really look at those symbols and what they might possibly do with a set of data. Remember, regular expressions are quite literal/explicit until we add in certain operators, metacharacters, or set expressions.

 

Note: If you’d like to see all three tables, they’re based on the International Components for Unicode (ICU) Regular Expression libraries.


 

Regular Expressions in Tableau

Now that you have an understanding of the history of regular expressions and their basic building blocks, let’s talk about how to create them in Tableau. Tableau has four different REGEXP functions, which either match, extract, or replace strings within a larger string of data. If that doesn’t make sense, be patient as I’ll be sharing some examples shortly.

 

REGEXP_EXTRACT (string, pattern) – Looks for a particular pattern within a string or substring and extracts that data element.

 

REGEXP_EXTRACT_NTH (string, pattern, index) – Looks for a particular pattern within a string or substring, starting at the nth position in the string, and extracts that data element.

 

REGEXP_MATCH (string, pattern) – Looks for a particular pattern within a string or substring and returns TRUE if there’s an exact match.

 

REGEXP_REPLACE (string, pattern, replacement) – Looks for a particular pattern within a string or substring and replaces that data element with a different data element.

 

Tableau regular expressions can also be “nested”, meaning, you can use a REGEXP_REPLACE() function within another REGEXP_REPLACE() function, much like you can do currently with a regular REPLACE() function. Nesting allows you to make multiple passes over the data so that you do not need to cram all your logic into one expression. But let’s not get ahead of ourselves…or too overwhelmed! 


 

Tableau Examples

Let’s now look at some regular expressions examples. We’re going to start off with a few relatively straightforward examples, in order to ease into it.

 

We’re going to start out with some simple examples that then get progressively more complex as we move forward. I use the term “simple,” but for those who are new to regular expressions, these may seem anything but simple. Please don’t get overwhelmed as you work through thesethey can be tricky to get your head around, but with exposure, practice, and time, you’ll become a regular expressions master!

 

Remember the example tables of Metacharacters, Operators, and Sets?  We’ll be working with those various symbols from those tables to support our pattern matching and extracting efforts. To keep things basic, much like the Social Security Number pattern matching, we’ll use a simplified method which draws on all three of those tables for our first example.

 

For this example, let’s use the following simple set of student course data:

 

Text

Smith, Paul is a student of English, Student ID: ABC123

Jones, Mary is a student of Math, Student ID: DEF345

Lee, Sally is a student of Economics, Student ID: GHI678

Brown, Sam is a student of Music, Student ID: ABC345

Black, Kelley is a student of Business, Student ID: ABC123

 

Just copy those lines into Excel or Google Sheets, then use either Tableau Desktop or Tableau Prep to connect to your spreadsheet.

 

We will establish just 2 goals for matching on this data:

 

1) Get the person’s last name.

2) Get the person’s first name.

 

We’re going to use REGEXP_EXTRACT to extract each of these different components, using the following syntax:

 

REGEXP_EXTRACT([Text], '( )')

 

After the first comma we’re placing a beginning parenthesis and closing parenthesis within quotes. Why? Well, the quotes are required syntax, but the parenthesis is a different thing altogether. It took me a while to understand that, in almost every circumstance, a set of enclosing parentheses are needed to create what’s known as a “capture group”. In other words, what specific values are we looking to get out of the pattern? Tableau really, really likes capture groups. So, I’ll always start off with that initial syntax, then build my patternthose hieroglyphics in and/or around that set of parentheses.

 

Last Name

Let’s work on the first problem statement of getting the person’s last name. In our data, the last name is always the first “word” in the text, so the simplest syntax to capture group and extract that out would be:  

 

REGEXP_EXTRACT([Text], '(\w+)')

 

So, what’s going on here really?  We know that the parenthesis are our ‘capture group’, the data we’re seeking to extract or match upon.  That initial backslash is considered our lead-in or starting point of what we’re trying to grab.  The backslash by itself would mean start here or rather start with this next character. A small ‘w’ = match on a word character.  It’ll also match on alphanumeric values.  Using the backslash in REGEXP in combination with a small ‘w’ forces that small w to no longer be a literal letter ‘w’ but to become an action to do something. When followed by the + sign, it causes a match of all ensuing characters, but just once. Meaning, it’ll stop the match process when it encounters the next character that is not a word (alphanumeric) character. In Tableau, that syntax will return the following:

 

 

It’s exactly what we want out of this set of data. Notice that it stops at the comma as the REGEXP expression is looking for a whole ‘word’ in this case and is not looking for any additional characters or spaces. If you were to leave off the + sign, using just \w then you’d have the following match which is correct because it is a singular “word character” match:

 

 

See, that was easy!

 

First Name

What about the next problem statement/goal? To get their first name? It’s located after a space and a comma. Those will be our “anchors” off which we’ll work. We’ll use the following syntax:

 

REGEXP_EXTRACT([Text], ',\s+(\w+)')

 


For this situation, we’re stepping outside of our “capture group”. Why? Because we need to tell the regular expression to start at a specific point in the textin our case, we need to look forward from the comma and the space. Let’s break this syntax down into its component parts.

 

, – This is just a literal string. We’re searching for a comma.

 

\s – This metacharacter means “space”. So we’re searching for a space. Combined with the comma, we’re now looking for a comma followed by a space.

 

+ – This means we’re going to grab the word capture group right after the match of the command and space.

 

Could the \s+ be written as just \s? Yes, it could but I’m a double-space person so the + will cause the expression to check for any additional spaces between that comma and the capture group—just in case.


 

Coming Soon

I hope this provided a good, Tableau-focused introduction to regular expressions. But we’re not done yet! In the next blog in this series, I’ll share some more practice examples that I see regularly in my work and on the Tableau Community Forums.


6 comments:

  1. Thanks for sharing. Looking forward to a blog on backreferencing and looping/walking greedily thru a list of strings just like this. Also non-capture groups are a bit confusing for me to use in Tableau. Keep up the good work

    ReplyDelete
  2. This is a great post, thank you

    ReplyDelete
  3. Awesome post, as usual. I might add in the next post, how to test a RegEx function in regexr.com, and how to past the expression in a Tableau RegEx function.
    Cheerio,
    Franco

    ReplyDelete
  4. Thanks thanks thanks!

    ReplyDelete
  5. This is awesome. thanks

    ReplyDelete

Powered by Blogger.