Strata is one of the most awaited events of the year for the data community, and it is so for a good reason. It was the first time I attended the conference, and I got really impressed by the organization and the quality of the talks.
It is actually a mixed between a technical conference and an industry event, and you might equally happen to find yourself in a room watching Spark tutorials or listening to tips to implement data science projects within our organization.
I heard rumors that it is getting bigger year by year, even though there are fewer exhibitors coming (at least in London) with respect to previous editions, but I can guarantee that the quality of the speakers and talks is really high (choosing between the 60 daily talks is extremely hard, believe me).
I could have written a much longer article based on the insights and things I have seen during the event, but I try as always to summarize three main takeaways elaborated through my personal point of view.
I. New data roles and skills are shaping the data landscape
It is clear that the data landscape is moving at a stellar speed, and this is reflected in the number of new roles that companies are looking for and sometimes even creating.
One of those is definitely the Chief Data Officer. I have recently dealt with this topic in another blog post, but it became clear that the need for someone who guarantees that everyone can get access to the right data in virtually no time is really pressing.
In that sense, a CDO is the guy in charge of ‘democratizing data’ within the company because he is the end-to-end data workflow responsible that oversees the entire data value chain.
If the CDO will do his job in a proper way, you’ll be able to see two different outcomes: first of all, the board will stop asking for quality data and will have clear in mind what every team is doing. Second, and most important, a good CDO aims to create an organization where a CDO has no reasons to exist.
It is counterintuitive, but basically, a CDO will do a great job when the company won’t need a CDO anymore because every line of business will be responsible and liable for their own data.
At a lower level instead, we all noticed the transition from being a data analyst to become overnight a data scientist. It wasn’t surprising to many of us, but it is now increasingly fading away as industry players better grasp new technologies and scientific advancements and understand how to formulate, tackle and eventually try to solve a problem. On the other hand, though, the resources democratization made more accessible to many the means to learn data science concepts and this is pushing the sector even forward. In order to be a top-notch data scientist, it is not enough nowadays to know when to use random forests (structure data) or neural nets (unstructured data), but rather adding value to the data process and final user in alternative ways.
Data scientists are indeed evolving and the inspiring talk by Anthony Goldbloom from Kaggle told us how. The best data scientist (i.e., the ones who performed the best on Kaggle competitions over the last few years) seem to have a few characteristics in common:
- They are really creative, especially when it comes to features engineering;
- They know how to avoid (and understand the importance of) the model overfitting (this is their main priority when it comes to modeling);
- They use version control (at the end of the day they are scientists and manage their data as scientific logs and notebooks about their experiments).
Finally, if there is something that Strata made clear, it is that there aren’t two data scientists who are equal and diversity matters in a good way. I spent some time thinking about it in the past, and I reached the conclusion that the minimal data team should be made by at least four people, but I’d like to be challenged on that. My data dream team would be composed by:
- A (pure) data scientist: briefly speaking, this is the guy dealing with modeling;
- A data engineer: this is the guy maintaining the architecture and making the data available for the data scientist to be used straight away;
- A business intelligence analyst: this is the liaison between executives & other teams and the data team;
- A customer intelligence analyst: this guy is in charge of increasing customers satisfaction through the use of data models and communicating with the final users.
It does not matter how good your algorithms are or how many different silos of data you have on a single customer, the success of a data science project is still highly dependent on the quality of the team working on it, so spend time on this topic internally.
II. The data science journey is an uphill path
A common question kept coming over and over: is your company data-mature? Well, in many cases the answer is a big NO.
There are many ways to assess whether you are ready to embrace the change data science is bringing to your business, and many speakers presented their own path based on their own experiences. I have also developed my own basic framework and maturity model, which I am now proposing as a simple alternative to many others that different players are using and sponsoring in our industry.
This is a roadmap developed to implement a revenue-generating and impactful data strategy. It can be used to assess the current situation of the company and to understand the future steps to undertake to enhance internal big data capabilities.
The table on the left provides a four by four matrix where the increasing stages of evolution are indicated as Primitive, Bespoke, Factory, and Scientific, while the metrics they are considered through are Culture, Data, Technology, and Talent. The final considerations are drawn in the last row, the one that concerns the financial impact on the business of a well-set data strategy.
If you want to learn more about the maturity model, check this detailed explanation.
Why is this important? Well, for several reasons of course, but I mainly see it as a way to understand how to succeed. Because let’s be honest, many data science projects fail and you don’t even know why. Depending on the stage you are currently in then, you might end up doing one or more of the following common mistakes:
- Lack of business objectives and correct problem framing;
- Absence of scalability, or project not correctly sized;
- Absence of C-level or high management sponsorship;
- Excessive costs and time, especially when people with wrong skill sets are selected (which is more common than you think);
- Incorrect management of expectations and metrics;
- Internal barriers (e.g., data silos, poor inter-team communication, infrastructure problems, etc.);
- Think the work as a one-time project rather than a continuous learning;
- Data governance, privacy and protection.
However, if you don’t want to fail that fast, you might better follow Kim Nilsson’s (Pivigo) pieces of advice:
- Start small. Prove you deserve resources, attention and budget;
- Maintain Agility throughout the entire project;
- Select the right skills you need. You don’t need by default a machine learning superstar, but you need a stellar team;
- Manage the expectation correctly. This will really make your team becoming essential in the company or being discarded in a few months;
- Convince the skeptics. Data have value, so convince who does not think so.
III. Open Source is king in data science
There was almost no talk at Strata that was not explaining, implementing new solutions or simply using some open-source software.
I have been thinking about why big tech companies are giving their technology away for free for some time, and these are my conclusions:
- Open source technology raises the bar of the current state of the art for potential competitors in the field;
- Github accounts and packages look better and have a greater impact than a well-written resume in this world;
- It creates a competitive advantage in data creation/collection, in attracting talents (because of higher technical branding), and creating additive software/packages/products based on that underlying technology;
- Data scientists and developers are first of all scientists with a sharing mindset, and part of the industry power to attract and retain talents come from augmenting the academic offer (i.e., better datasets, interesting problem, better compensation packages, intellectual freedom);
- It drives progress and innovation in foundational technologies;
- It increases the overall value, easy of integration and reliability of internal closed source systems;
- It lowers the adoption barrier to entry, and gets traction on products that would not have it otherwise;
- It shortens the product cycle, because from the moment a technical paper is published or a software release it takes weeks to have augmentations of that product;
- More importantly, it can generate a data network effect, i.e., a situation in which more (final or intermediate) users create more data using the software, which in turn make the algorithms smarter, then the product better, and eventually attract more users.
- Making codes that others can read and understand is what makes you better developer and scientist. This is something you know only if you have ever done it.
Food for thoughts
I arbitrarily selected few of the insights I got at the conference, but much more food for thoughts was provided by the speakers as well as from casual conversations with participants. Here it follows a short list of things you might want to stop thinking about:
- Can robots be creative? Can they recognize beauty?
- What’s the solution to our broken data landscape? (spoiler: Metadata!)
- How much is important the causal impact relationship in machine learning modeling and how do we measure it?
- More data beat a better model? (spoiler again: a big YES!)
- Instead of using a single big deep learning network, can we use more networks?
- Can data be useful and perfectly anonymous at the same time?
- Are big data technologies ethically neutral?