Big Data Strategy (Part I): tips for analyzing your data
We have seen in a previous post what are the common misconceptions in big data analytics, and how relevant it is starting looking at data with a goal in mind.
Even if I personally believe that posing the right question is 50% of what a good data scientist should do, there are alternative approaches that can be implemented. The main one that is often suggested, in particular from non-technical professionals, is the “let the data speak” approach: a sort of magic random data discovery that should spot valuable insights that a human analyst does not notice.
Well, the reality is that this a highly inefficient method: (random) data mining it is resource consuming and potentially value-destructive. The main reasons why data mining is often ineffective is that it is undertaken without any rationale, and this leads to common mistakes such as false positives; over-fitting; neglected spurious relations; sampling biases; causation-correlation reversal; wrong variables inclusion; or eventually model selection (Doornik and Hendry, 2015; Harford, 2014). We should especially…