My learning journey: AI & DS
Takeaways from the AI & Data Science Capital Markets conference (London, Mar. 2017)

Last week I was in London at the conference on AI & Data Science in Capital Markets. Let’s start saying that I quite liked the format — 2 days packed with several short presentations (15 minutes each), followed by panels made of the 4–5 people who actually made those presentations in the first place. This model reiterated for 8 hours makes a lot of presentations per day.
A bit too formal from an organizational point of view, but given the audience maybe required (many bankers, regulators, funds, etc.).
Said so, it was an incredibly informative event and I personally heard and learnt a lot. I will try anyway to summarize three main takeaways from the conference, as usual.
I. Open Source AI

The first thought is about open sourcing technologies. I have already written on this trend, which is quite unusual at a first look if you think about it, but my thinking around open source has been highly stimulated by the talk given by Wes McKinney — for who doesn’t know who he is, well, he is definitely not a random guy but is THE open source guy (creator of pandas and author of Python for Data Analysis).
The open source model is quite hard to be reconciled with the traditional SaaS model, especially in the financial sector. However, we are observing many firms providing cutting-edge technologies and algorithms for free. While in some cases there is a specific business motivation behind it (e.g., Google releasing Tensorflow to avoid conflict of interests with their cloud offering), the decision of open sourcing (part of) the technology actually represents an emerging trend.
Tools are nowadays less relevant than people or data and the sharing mindset is a key asset for organizations. Based on this statement, we can divide the considerations on open source in two clusters, which are business considerations and individual considerations.
From a business perspective, the basic idea is that is really hard to keep the pace with the current technological development and you don’t want your technology to become obsolete in three months time. It is better to give it out for free and set the benchmark rather than keeping it proprietary and discard it after a few months. Furthermore, open sourcing:
- Raises the bar of the current state of the art for potential competitors in the field;
- Creates a competitive advantage in data creation/collection, in attracting talents (because of higher technical branding), and creating additive software/packages/products based on that underlying technology;
- Drives progress and innovation in foundational technologies;
- Increases the overall value, interoperability and sustainability of internal closed source systems;
- Raises awareness of the problems faced at scale on real-world data;
- Lowers the adoption barrier to entry, and get traction on products that would not have it otherwise;
- Shortens the product cycle, because from the moment a technical paper is published or a software release it takes weeks to have augmentations of that product;
- More importantly, it can generate a data network effect, i.e., a situation in which more (final or intermediate) users create more data using the software, which in turn make the algorithms smarter, then the product better, and eventually attract more users.
From the developer’s point of view instead, there are a series of different interesting considerations:
- Github accounts and packages look better and have a greater impact than a well-written resume in this world;
- Data scientists and developers are first of all scientists with a sharing mindset, and part of the industry power to attract and retain talents come from augmenting the academic offer (i.e., better datasets, interesting problem, better compensation packages, intellectual freedom);
- Academia has been drained of talents who moved to the industry and the concept of ‘academic publication review’ has been translated into ‘peer review’ (crowd-reviewing). This is in turn translated into i) better troubleshooting, ii) deeper understanding of technology potential and implications;
- Making codes that others can read and understand is what makes you better developer and scientist. This is something you know only if you have ever done it;
- As a general rule-of-thumb, if the contributors are from academia, they usually push innovation forward, while industry contributors prefer more system stability. Releasing open source software helps you thinking about who will use it and design the entire software more reliable and stable in the first place.
These are some of the reasons why this model is working nowadays, even though there are advocates who claim incumbents to not really be maximally open (Bostrom, 2016) and to only release technology somehow old to them.
My personal view is that companies are getting the best out of spreading their technologies around without paying any costs and any counter effect: they still have unique large datasets, platform, and huge investments capacity that would allow only them to scale up.
Regardless the real reasons behind this strategy, the effect of this business model on the AI development is controversial. According to Bostrom (2016), in the short term, a higher openness could increase the diffusion of AI. Software and knowledge are non-rival goods, and this would enable more people to use, build on top on previous applications and technologies at a low marginal cost, and fix bugs. There would be strong brand implications for companies too.
On the long term, though, we might observe less incentive to invest in research and development, because of free riding. Hence, there should exist a way to earn monopoly rents from ideas individuals generate. On other side, what stands on the positive side is that open research is implemented to build absorptive capacity (i.e., it is a mean of building skills and keeping up with state of art); it might bring to extra profit from owning complementary assets whose value is increased by new technologies or ideas; and finally, it is going to be fostered by individuals who want to demonstrate their skills, build their reputation, and eventually increase their market value.
II. The Power of the Crowd

Real general AI will likely be a collective intelligence. It is quite likely that a superintelligence will not be a single terminal able to make complex decisions, but rather a group intelligence. A swarm or collective intelligence (Rosenberg, 2015; 2016) can be defined as “a brain of brains”. So far, we simply asked individuals to provide inputs, and then we aggregated after-the-fact the inputs in a sort of “average sentiment” intelligence.
Real general AI will likely be a collective intelligence.
According to Rosenberg, the existing methods to form a human collective intelligence do not even allow users to influence each other, and when they do that they allow the influence to only happen asynchronously — which causes herding biases.
An AI on the other side will be able to fill the connectivity gaps and create a unified collective intelligence, very similar to the ones other species have. Good inspirational examples from the natural world are the bees, whose decision-making process highly resembles the human neurological one. Both of them use large populations of simple excitable units working in parallel to integrate noisy evidence, weigh alternatives, and finally reach a specific decision.
According to Rosenberg, this decision is achieved through a real-time closed-loop competition among sub-populations of distributed excitable units. Every sub-population supports a different choice, and the consensus is reached not by majority or unanimity as in the average sentiment case, but rather as a “sufficient quorum of excitation” (Rosenberg, 2015). An inhibition mechanism of the alternatives proposed by other sub-populations prevents the system from reaching a sub-optimal decision.
Why am I saying all this? Well, because even if all this idea of swarm intelligence may look (and be) really far, we are already observing some sort of crowd intelligence applied in the market. The human collective intelligence is, I believe, the first step to draw an artificial swarm intelligence. These are few of the pioneers in the space:
- Leigh Drogen (Estimize): the idea on which Estimize is based is extremely fascinating, because it provides earnings forecasts and economics estimates through the ‘wisdom of the (financial) crowd’. There are other similar examples, e.g., Almanis, Ace Consensus, Premise, only to name a few;
- Richard Craib (Numerai): Numerai is actually another crowdsourcing example where ‘tournaments’ are created, and the data scientist who implements the best AI models and reach the best prediction win a sum of money from the company. They recently raised $6M in a Series A round and created their own cryptocurrency to modify market participants’ incentive and make the market itself more efficient;
- Thomas Wiecki (Quantopian): similar concept behind Numerai, with the difference though that it is requested to submit the complete algorithm and the scientist who wrote it can even decide to license it and get paid on a performance basis.
III. No matter what, it is still all about data

We could keep going talking about how good our neural networks are, or the incredible business motif behind our product or even about the improvements in the processing speed of the last chip on the market. The reality is, no matter what, that it is everything still about data.
Data is still the only thing that matters
Peter Hafez from RavenPack pointed out a very interesting concept (on which I made some reflections time ago): we are “data hoarders”, i.e., we accumulate data day by day even if we don’t need it only because we can do it effortlessly. In finance, this phenomenon is even more exasperated because there is the possibility that unexplored data might contain the next ‘nugget of alpha’. Until what point is useful to store and retain data then? (I am not talking here about what you should ethically store but rather what you want to store for business reasons).
MY SUGGESTION: Do not store and process data just for the sake of having them, because, with the amount of data being generated daily, the noise increases faster than the signal (Silver, 2013).
Pareto’s 80/20 rule applies: the 80% of the phenomena could probably be explained by the 20% of the data you have.
The overwhelming amount of data also generated a fundamental shift in the investment philosophy over the last few years, which is characterized by few points:
i) No assumption is made on functional forms of the models;
ii) Non-financial traditional data are used in financial contexts to extract temporary alphas;
iii) Data lead the research efforts to the detriment of fundamental analysis;
iv) Data is the great barrier to entry (and compete) in the market;
v) Data exhaust are extremely relevant —data generated by digital activities, that are not the core of your business, not easily monetizable, but though a potential source of alpha.
This is a snapshot of how things are today. I see though things changing: companies like Vicarious, Geometric Intelligence (acquired by Uber three months ago) or more recently Gamalon are working toward reducing the data burden needed to train neural networks. The amount of data required nowadays represents the major barrier for AI to be widely adopted (as well as the major competitive advantage), and the use of probabilistic induction (Lake et al., 2015) could solve this major problem for an AGI development. A less data-intensive algorithm might eventually use the concepts learned and assimilated in richer ways, either for action, imagination, or exploration.
Waiting for the next AI event…many more conferences coming soon, so stay tuned!
References
Bostrom, N. (2016). “Strategic Implications of Openness in AI Development”. Working paper.
Lake, B. M., Salakhutdinov, R., Tenenbaum, J. B. (2015). “Human-level concept learning through probabilistic program induction”. Science, 350(6266): 1332–1338.
Rosenberg, L. B. (2015). “Human Swarms, a real-time method for collective intelligence”. In Proceedings of the European Conference on Artificial Life (pp. 658–659).
Rosenberg, L. B. (2016). “Artificial Swarm Intelligence, a Human-in-the-loop approach to A.I.” In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) (pp. 4381–4382).
Silver, N. (2013). The Signal and the Noise: The Art and Science of Prediction. Penguin.