The Question is the Question


Big Data anyone? Hadoop? NoSQL? NewSQL? SomeSQL? AnySQL? AllSQL? <Insert New Over-Hyped Technology Here>?


It seems like all we hear about these days are new classifications of data and new tools and techniques to work with that data—Big Data, Hadoop, NoSQL, In-Memory Analytics, Real-Time Analytics, Machine Learning, Internet of Things, Artificial Intelligence, Predictive Analytics, Prescriptive Analytics, etc. etc. etc.

Software companies are all too eager to sell us new technologies to address the next big thing and we keep biting. A hardware/software vendor (who shall go unnamed) once asked me, “Are you doing Hadoop?” I found the question a bit jarring, in all honesty. To be fair, it was an understandable question considering the conversation we were having and the fact that they probably wanted to sell me something else, but it occurred to me that we spend an awful lot of time talking about what technologies and techniques we use.

What’s the Question?
That fact is that questions like “Are you doing Hadoop” show a propensity for us to ask the wrong types of questions. The question we should be asking is "What is the question?" After all, the reason for analytics, data science and related disciplines should be to answer questions. Without a good question, the data, technologies, and techniques we use are pointless. Ask any data scientist and he/she will tell you that the most important thing is the question. So, before we start any data or analytics project, we need to first determine what question(s) we’re trying to answer.

Of course, questions need answers, so once you know the question, you can find the right data, technology, statistical analyses, etc. to answer the question. Perhaps that will mean that you need to acquire some set of “big data”; perhaps you may need to use Hadoop or leverage an in-memory analytics appliance; perhaps you’ll need to write some R code; or maybe you just need to do some simple data manipulation in Excel! The tools, technologies, and techniques are important, but only because they enable you to answer a question. And for simple questions, we can often get by with simple technologies and techniques.

This is where analytics programs need to start first—generally speaking, you have to answer the simple questions first before you can start answering complex ones. I’d actually argue that it’s pretty hard to even think of the complex questions without first asking the simple ones, which leads me to my next point.

The “Endless Why”
We were all once children and I'm willing to bet that most of us, at one point, drove our parents crazy with the "endless why". For example:

Dad: Son, eat your broccoli.

Son: Why?

Dad: Because it’ll make you grow strong.

Son: Why?

Dad: Because broccoli contains healthy nutrients.

Son: Why?

And so on…

This is pretty normal behavior for a toddler (although, with my own kids, I often wondered if they actually wanted to know or were just trying to drive my wife and me crazy). But, let’s assume that we are all just naturally inquisitive as children. The problem is that we lose this as we get older and we forget how to ask questions.

In analytics, we need more of the endless why. That’s what’s so great about data—when you ask a good question, you can use the data and tools available to find a good answer, but it doesn’t have to stop there. Good answers often breed better questions, creating a cycle of what a fellow analytics professional called "better answers to better questions.” As data and analytics professionals, I think this should be our goal—to constantly and iteratively seek better answers to better questions. We’ll use various technologies along the way, but ultimately, our goal should be to answer a question—without that, what’s the point?

Ken Flerlage, October 25, 2015





No comments:

Powered by Blogger.