I have recently attended the Big Data Innovation Summit in London, a co-hosted event with the CTO Summit, the Big Data & Analytics for Pharma Summit as well as the Predictive Analytics Summit. Standard event format with very few panels and several half an hour talks, but very good from the point of view of the variety of topics covered.
As usual, I will now try to summarize my three personal re-elaborated takeaways from the conference.
I. Big Data capability is still an (hard) enterprise choice
Big data capabilities are increasingly becoming relevant to different industries and sectors, but are not widely understood yet. Implementing a big data strategy often comes with a series of issues, starting with the lack of strong internal data policies.
Data need to be consistently aggregated from different sources of information, and integrated with other systems and platforms; common reporting standards should be created — the so-called master copy — and any information should be validated to assess accuracy and completeness.
Having a solid internal data management, together with a well-designed golden record, helps to solve the huge issue of stratified entrance: dysfunctional datasets resulting from different people augmenting the dataset at different moments or across different layers.
However, data collection and aggregation is only one of the steps to be undertaken, and many organizations are finding useful to adopt lean approaches for big data development.
The relevance of the following framework lies in avoiding the extreme opposite situations, namely collecting all or no data at all. The next figure illustrates key steps towards this lean approach to big data.
First of all, business processes have to be identified, as well as the analytical framework that should be used.
These two consecutive stages (business process definition and analytical framework identification) have a feedback loop, and the same is also true for the analytical framework identification and the dataset construction. This phase has to consider all the types of data, namely data at rest (static and inactively stored in a database); at motion (inconstantly stored in temporary memory); and in use (constantly updated and store in database).
The modeling step embeds the validation as well, while the process ends with the scalability implementation and the measurement. A feedback mechanism should prevent an internal stasis, feeding the business process with the outcomes of the analysis instead of improving continuously the model without any business response.
Those frameworks and tools do not guarantee though the success of big data initiatives. There are indeed some more commons mistakes made by companies trying to implement data science projects:
- Lack of business objectives and correct problem framing;
- Absence of scalability, or project not correctly sized;
- Absence of C-level or high management sponsorship;
- Excessive costs and time, especially when people with wrong skill sets are selected (which is more common than you think);
- Incorrect management of expectations and metrics;
- Internal barriers (e.g., data silos, poor inter-team communication, infrastructure problems, etc.);
- Think the work as a one-time project rather than a continuous learning;
- Data governance, privacy and protection.
II. Data Science is all about Talents
It does not matter how good your algorithms are or how many different silos of data you have on a single customer, the success of a data science project is still highly dependent on the quality of the team working on it.
In reality, data scientists as imagined by most do not exist because it is a completely new figure, especially for the initial degrees of seniority. However, the proliferation of boot camps and structured university programs on one hand, and the companies’ increased awareness about this field on the other hand, will drive the job market towards its demand-supply equilibrium: firms will understand what they actually need in term of skills, and talents will be eventually able to provide those (verified) required abilities.
It is then necessary at the moment to outline this new role, which is still half scientist half designer, and it includes a series of different skills and capabilities, akin to the mythological chimera. An ideal profiling is then provided in the following figure, and it merges basically five different job roles into one: the computer scientist, the businessman, the statistician, the communicator, and the domain expert.
However, identifying the right set of skills (see the full list of skills here) is not enough. First of all, data science is a team effort, not a solo sport. It is important to hire different figures as part of a bigger team, rather than hiring exclusively for individual abilities.
‘Data science is a team effort’
Moreover, if a data science team is a company priority, the data scientists have to be hired to stay and not simply on a project-base because managing big data is a marathon, not 100 metres.
Data scientists have two DNAs
Second, data scientists come with two different DNAs: the scientific and the creative one. For this reason, they should be let free to learn and continuously study from one hand (the science side) and to create, experiment, and fail from the other (the creative side). They will never grow systematically and at a fixed pace, but they will do that organically based on their inclinations and multi-faceted nature. It is recommended to leave them with some spare time to follow their ‘scientific inspiration’.
Please consider also this food for thought: “not all the data scientists are created equal”.
Big Money is not all that matters
Finally, they need to be incentivized with something more than simply big money. The retention power of a good salary is indeed quite low with respect to interesting daily challenges, relevant and impactful problems to be solved, and being part of a scientific bigger community (i.e., being able to work with peers and publishing their research).
III. Data Privacy: don’t do unto others what you don’t want others to do unto you
There are two important concepts to be considered from a data protection point of view: fairness and minimization.
Fairness concerns how data are obtained, and the transparency needed from organizations that are collecting them, especially about their future potential uses.
Data minimization regards instead the ability to gather the right amount of data. Although big data is usually intended as “all data”, and even though many times relevant correlations are drawn out by unexpected data merged together, this would not represent an excuse for collecting every data point or maintain the record longer than it is required to.
Furthermore, no matter how strong the data privacy would be, people may not be open to share data for different reasons, i.e., either lack of trust or because they have something to hide. This may generate an adverse selection problem that is not always considered because clients who do not want to share might be perceived as they are hiding something relevant (and think about the importance of this adverse selection problem when it comes to governmental or fiscal issues).
Private does not mean necessarily secret, and shared information can still remain confidential and have value for both the parts — the clients and the companies.
Waiting for the next data event…many more conferences coming soon, so stay tuned!
Note: part of the images and materials are taken from my book “Big Data Analytics: A Management Perspective” (Springer, 2016).