This article aims to explain few common concepts and terms in the big data world for an audience that is not as technical as an engineering one but at the same time is at least familiar a bit with the big data space. The words are in no particular order, and a bit of technical language is used. Reach out if something is not clear!
Relational database management system (RDBMS)
Structured data in a predetermined schema (tables), scalable vertically through large SMP servers, or horizontally through clustering software. These databases are usually easy to create, access, and extend. The standard language for relational database interoperability is the Structured Query Language (SQL).
It is a formally constructed language designed to communicate instructions to a machine. The main ones for data science applications are Java, C, C++, C#, R, and Matlab. Scala is another language that is becoming
extremely popular right now, but it is an example of functional language.
An open source software for analyzing a huge amount of data on a distributed system. His primary storage system is called Hadoop distributed file system (HDFS), which duplicates the data and allocates them in different nodes. It has been written in Java. It is a core technology in the big data revolution and stores data into their native raw format, and it can be used for several purposes (Dull, 2014), such as a simple data staging or landing platform complementary to the existing EDW (as an enterprise data hub, i.e., EDH), or managing data (even small), transforming those into a specific format in the HDFS and sending them back to the EDW, lowering thus the costs while increasing the processing power. Furthermore, it can integrate external data sources and archive data (both on-premises or into the cloud), and reduce the burden for a standard EDW.
Software for parallel processing huge amount of data.
Service to gather, aggregate, and move chunks of data from several sources to a centralized system.
An open source database system for analyzing a large amount of data on a distributed system. It is characterized by a high performance and by a high availability with no single point of failure (i.e., a part of the system that if fails stops the whole system). It fosters data denormalization, which means grouping data or adding redundant information, in order to optimize the database performance.
Multiple terminals communicating between them. The problem is divided into many tasks and assigned to each terminal. It is a highly scalable system as further nodes are added.
Google File System
Proprietary distributed file system for managing efficiently large datasets.
An open source non-relational database (column-oriented) developed on an HDFS. It is very useful for real-time random read and write access to data, as well as to store sparse data (small specific chunk of data within a vast amount of them). The relational counterpart is called Big Table.
Enterprise Data Warehouse (EDW)
A system used for analysis and reporting that consists of central repositories of integrated data from a wide spectrum of different sources. The typical form of an EDW is the extract-transform-load (ETL), that is the most representative case of bulk data movement, but other three important examples of these systems are data marts (i.e., a subset of the EDW extracted out in order to address a specific question), Online analytical processing (OLAP) — used for multidimensional low-frequency analytical query — and Online transaction processing (OLTP) — used rather for high volume fast transactional data processing. The wider system that includes instead a set of servers, storage, operating systems, database, business intelligence, data mining, etc., is called data warehouse appliance (DWA).
Resilient Distributed Datasets (RDD)
A logical collection of data partitioned across machines. The most known example is Spark, an open source clustering computing that has been designed to accelerate analytics on Hadoop thanks to the multi-stage in-memory primitives (that are basic data types de ned in programming languages or built it with their support). It seems to run 100 times faster than Hadoop, but its disadvantage is that it does not provide its own distributed storage system.
An additional example of EDW infrastructure that facilitates data summarization, ad-hoc queries, and specific analysis.
A platform for processing huge amount of data through a native programming language called Pig Latin. It runs at the same time sequences of MapReduce.
It is a subset of the data warehouse used for a specific purpose. Data marts are then department-specific or related to a single line of business (LoB). The next level of data marts is the Virtual Data Marts, i.e., a virtual layer that creates various views of data slices — in other words, instead of physically creating a data mart, it just takes a snapshot of them. The final evolution is instead called Data Lakes, which are massive repositories of unstructured data with an incredible computational capability. Hence, data marts physically create repositories (slices) of data, virtual data marts leave the data where they are and create virtual constructs — reducing the cost of transferring and replicating them — while data lakes work as the virtual data marts but with any kind of data format.
Dull, T. (2014). A Non-Geek’s Big Data Playbook. SAS Best Practices White paper. Retrieved from http://www.sas.com/content/dam/SAS/en_us/doc/whitepaper1/non-geeks-big-data-playbook- 106947.pdf.
Note: the above is an adapted excerpt from my book “Big Data Analytics: A Management Perspective” (Springer, 2016).