KubeCon co-chair and Senior Software Engineer at Google, Janet Kuo was quoted stating “it’s all of the solutions and extensions that expand from Kubernetes that will dramatically change the world as we know it.” Kubernetes has a number of extensions like CRDs that extend the Kubernetes API to include custom use cases, as well as CNI, CSI, and security extensions, all of which allow Kubernetes to manage everything from data to storage, networking, and infrastructure. In fact, with reference to infrastructure management, in particular, Kubernetes has permanently changed the way things are done, moving away from the traditional ticket-raising system to a self-service model that uses infrastructure as code.
From the news to hospitality, e-commerce, dating, banking, music, and more, Kubernetes is quickly becoming the de-facto standard for any kind of software requirement with Big Data being no exception. While Big-Data was all about Hadoop till not that long ago, the advantages associated with containers, microservice, and cloud-based Kubernetes environments are just too much to ignore. Unlike Hadoop, Kubernetes was built with the modern cloud-based environments in mind and comes with an ever-expanding ecosystem of related tools and services. Kubernetes also supports hybrid environments and allows users to build clusters that span multiple cloud, edge, and on-premise locations.
Hadoop and the Big-Data revolution
Let’s discuss Hadoop and the way things were done pre-Kubernetes. Hadoop is an open-source framework that’s incredibly reliable and efficient at dealing with massive datasets. It was also quite economical up until a few years ago as Hadoop didn’t require expensive hardware and allowed users to distribute their workload across hundreds of commodity servers fitted with as many cheap disks as they could hold. This ability to build a high-performance, fault-tolerant, scalable cluster from cheap servers was unheard of at the time and revolutionized the world of big data.
Hadoop achieved this by something called distributed parallel processing where data is split up into blocks and distributed across multiple nodes and processed in parallel. This way each node processes the data stored on it as opposed to wasting time moving it all across the network.
Hadoop has four modules powering its functionality:
- A minimalistic distributed file system called HDFS orchestrates storage
- YARN to schedule jobs and manage tasks and clusters
- MapReduce is a big-data processing engine that supports parallel programming and helps extract meaning from unstructured data. Alternatively, Hadoop also supports other processing engines like Spark, Hive, Pig, Tez, and Kafka
- A set of libraries called Hadoop Common
Containers and modern applications
While Hadoop is still a popular solution for big-data analytics, Hadoop was built in the early 2000s with an initial release in 2006. This was at a time when databases, data warehouses, and storage systems were expensive as well as unscalable and their biggest challenge was network latency. This is why organizations would prefer to store their data on-premise in order to avoid the hassles associated with moving around such large datasets. Times have changed and containers are now the standard unit of software deployment while operating big data workloads in the cloud has become commonplace.
Without containers, managing dependencies and updates can be a very labor-intensive process. Hadoop has been patched and modified by adding a number of different Apache projects to deal with these fundamental changes in computing. While YARN was initially developed to run isolated java processes for big data workloads, it was then modified to support Docker containers. So while modern analytical software is python-based and runs on microservice architecture, YARN users still need to deal with java and HDFS which is quite restricting. In contrast, Kubernetes was built from a clean slate with modern software in mind and the Docker Kernel in particular.
Spark and real-time analytics
Modern applications are constantly producing data at an unprecedented rate. Every action by a user creates data in the form of computer-generated records called event logs. Real-time big data analytics involves analyzing event logs virtually as soon as they’ve been created. Organizations depend on these real-time big data analytics for a number of time-sensitive metrics, like customer behavior or sentiment, for example, to provide unique and enhanced end-user experiences. While Hadoop is known to struggle with data in real-time, Spark excels at real-time batch and stream processing while also powering AI and ML applications that require big data.
There are a number of advantages to running Spark on Kubernetes as opposed to a Hadoop YARN-based environment. In addition to the fact that all dependencies get packaged into containers in order to avoid dependency issues that are common with Spark, Kubernetes affords a lot more control over how applications consume resources. Kubernetes also has a healthy ecosystem of tools like Helm that help automate setup and configuration as well as tools like Prometheus and Grafana that aid in real-time monitoring.
The future of big data analytics
As we move forward into new technological territories like Edge computing, AI, NLP, VR, and more, datasets are only going to get bigger, and making sense of them is going to be more critical than ever before. AI, in particular, requires massive datasets to train ML models and as the technology matures and improves, it’s going to get hungrier. Size is relevant, however, and what we considered to be a large file ten years ago could probably fit on a pen drive today. As organizations grow and develop and get used to analyzing big data for insight on a regular basis, there might even come a day when we drop the “big” and go back to just calling it data.
If you have questions related to this topic, feel free to book a meeting with one of our solutions experts, mail to email@example.com.