Data Engineering with Kubernetes - The Best Choice for Big Data Applications

Learn about management of containerized Big Data applications via Kubernetes with focus on its architecture, features, and deployment tips.

Challenges of Big Data Applications

In order to stay competitive, companies must constantly improve their products or services, implement innovative solutions, analyze user experiences, find target audiences for advertising, etc. These activities, especially for large companies or corporations, require working with and processing large amounts of data.

Big Data applications are among the best solutions for statistical analysis, predictive modeling, and other science projects, and are designed to cope with vast volumes of data with various structures at high velocities. Big Data has become a standard utility for processing terabytes of information on clusters by distributing tasks among dozens of computer nodes. Horizontal scaling seems like a good enough solution for heavy workloads, but what challenges do Big Data applications face today?

We will skip common problems, like designing software architecture in the Big Data paradigm that only skilled and experienced engineers could resolve. One of the widespread challenges is dynamically scaling the computing scaling capacity relative to current needs by adding or removing servers to/from the cluster. Since the setup for the Big Data applications environment requires a time-consuming and complex installation procedure for special software, adding or removing servers into the cluster quickly and automatically is not an option.

Also, such environments are usually used only for Hadoop-based apps but not for web servers or other production stuff. So companies must maintain both a production environment and a Big Data ecosystem, leading to high costs. It is also difficult to predict how much computing power will be required for the data engineering tasks, so it may be necessary to increase or decrease these demands after some time.

So we can see that Big Data applications in their original form are not flexible, are challenging and time consuming to install and set up, and are expensive to maintain, but today new solutions have been developed that alleviate some of these downsides.

What is Kubernetes

Before we look at Data Engineering tools for Kubernetes, let’s very quickly understand what Kubernetes is. Kubernetes (also known as “K8s”) is an open-source platform developed by Google for orchestrating containerized applications that can be hosted on hundreds or more servers. According to recent research, it is the most widespread solution for scaling high-load systems.

Containerized apps are processes that run in an isolated environment called a container. Such containers contain all needed dependencies, like libraries, frameworks, other software, and files. That’s why they can be quickly launched on every computer with an appropriate runtime engine. Kubernetes allows us to easily manage many containerized apps, their behavior, and their interconnection with each other through its API at high-level abstraction. The simplest way to deploy a containerized app is Kubernetes Pod, which contains definitions of one or more containers, their arguments, env variable and other configuration. With Kubernetes Deployment we can define a template for Pod, manage the number of replicas, control the current state of application by rolling out a new version or rollback to previous one. Besides API there is a Kubernetes Dashboard – a web panel that also helps control the cluster.

Kubernetes architecture is generally split into two parts: the control plane and worker nodes (compute machines). The control plane is the master server, which consists of various components that manage worker nodes and Pods in the cluster. Master node consists of a kube-apiserver, an etcd storage, a kube-controller-manager, a cloud-controller-manager, a kube-scheduler, and a DNS server for Kubernetes Services. So these components are responsible for managing resources, scheduling workloads on appropriate worker nodes, saving the state of all objects on the cluster, and processing requests on the server. Worker nodes consist of a kubelet, a kube-proxy, and a container runtime. These services manage workloads, report about their health, expose Kubernetes Services to the external world. Kubernetes can use different container runtimes, which are responsible for running containers, but now containerd has become more popular and displaces docker.

OpenShift is an enterprise Kubernetes container platform developed by RedHat. It seems that this is the same as K8s, but what is the difference between OpenShift and Kubernetes? OpenShift contains components of open-source Kubernetes and adds some features for security and productivity. One of the key differences is that OpenShift can be launched only on RedHat’s proprietary distributives, while Kubernetes is supported on any Linux OS. Also, OpenShift provides commercial support, better web GUI for managing resources on cluster, stricter security policies. But RedHat’s support costs money, while Kubernetes can be installed for free and has a big community that can help with any problem.

So what is Kubernetes used for? It is used for managing containerized applications because it facilitates declarative configuration and automation. With the help of Kubernetes any software solution can be easily deployed both for testing and production purposes, even when a project consists of multiple services.

Kubernetes vs Docker

Docker is just a runtime for containers, while Kubernetes is the platform for managing running containers and their configuration. So Docker and Kubernetes aren’t competitors, they are just two technologies with their own tasks. But they complement each other. Docker provides tools for building container images, testing and running them, publishing them to container registries. So with the help of Docker, we can make our applications containerized, and after that deploy them on the Kubernetes cluster. Also, there is Docker Swarm, which can manage multiple running containers on several servers. Still, Kubernetes is a more powerful tool for that, because Docker Swarm provides just a small set of basic operations on running containers and their configuration.

Data Engineering and Data Science

Data Engineering is focused on preparing data from raw form, extracting, collecting, and combining such data from different sources. Data engineers pass clean data, which doesn’t contain human or machine errors, to data scientists for analyzing and solving business problems. Also, Data Engineering is about developing and maintaining data pipelines, which can be used by data scientists or in production. So the main idea of data engineering is to collect, preprocess, move and store data of different forms. All these processes together can be called a data pipeline. For designing, building, and arranging data pipelines such tools can be used: Hive, Spark, Sqoop, Cassandra, PostgreSQL, Airflow, YARN, etc. Data engineer is responsible for the accuracy of the data, effectively launching data pipelines to satisfy business requirements.

Data Science performs decision-making processes, answers questions and provides metrics to solve business problems. Data scientists work with data provided by data engineers, analyze them with different tools and strategies, and give insights for business based on results. They can use various methods, like machine learning, statistical models, artificial intelligence, etc. By finding hidden patterns in vast amounts of data, data scientists can predict trends and make decisions about new features, improvements for the company. Data scientists often use programming languages like Python, R, Julia, etc., and frameworks like Tensorflow and PyTorch for experiments. Both Data Engineering and Data Science are valuable. Still, Data Science couldn’t exist without Data Engineering, so powerful and efficient solutions for the last one can really speed up obtaining desired results.

Solutions for Data Engineering in Kubernetes

It seems that Kubernetes is a suitable environment for a wide range of tasks, but what Data Engineering solutions are present in it? One of the most popular applications in Big Data platforms is Spark, and the open-source community is also developing a particular version for Kubernetes. Spark Operator, which represents this map-reduce tool in the Kubernetes ecosystem, doesn’t need a standard scheduler like YARN or Zookeeper – because Kubernetes replaces them.

Also, Kubernetes provides Container Storage Interface (CSI), which supports prevalent file storage systems for distributed apps like HDFS, NFS, or even S3. In recent years machine learning (ML) or other artificial intelligence (AI) projects have gained popularity rapidly. And Kubernetes has got an excellent ML workflow platform which is called Kubeflow. In Kubeflow, we can launch pipelines for training models using Seldon, TensorFlow, PyTorch, etc.

This framework can also deploy trained models in Kubernetes and serve them for production purposes, including replication workloads and load-balancing requests between them. JupyterHub notebooks provide an interactive sandbox with Python, R, and other programming languages for engineers to perform data science tasks, do experiments, and test hypotheses.

Kubernetes is a good orchestrator for microservices that encloses dedicated algorithms under the hood and executes them automatically. However, we may need to schedule custom tasks or even complicated scenarios that use different apps. In such a case, we can use Airflow – the platform for launching DAGs by a given schedule. DAGs are custom programmable workflows where, for example, we can connect to the FTP server, download some data, process it by python code, and save results in the database. Airflow has got many providers, so it’s easy to integrate with favored solutions. In Kubernetes, we can deploy different databases, whether relational or NoSQL. To sum up, Kubernetes has a wide range of solutions for Data Engineering, MLOps, etc.

Kubernetes Features, Pros, And Cons

Among the many features of Kubernetes, we can point out the following valuable ones:

  • Automated horizontal scaling of workloads can automatically replicate containers on other nodes when the specified percentage of allocated resources is exhausted. It is done by HorizontalPodAutoscaler, which can save computing resources when the traffic is low.
  • There are many high-level objects for easy development of microservice architecture, such as secrets, services, configuration maps, access rights, containers, data stores, namespaces, etc.
  • Interaction with the Kubernetes cluster can be done by console, web app, and REST API interfaces.
  • Any actions can be performed through YAML files with concise and readable syntax for data representation.
  • We can set clear limits on the consumption of CPU and RAM resources for any workload.
  • A built-in load balancer that evenly distributes applications across all connected nodes. Also, network connections to Kubernetes service can be load-balanced between several containers.
  • Role-based access control to the cluster: only authorized users with the proper rights can perform the set actions.
  • We can schedule jobs by CronJob, which is meant to perform repeated actions such as backups, report generation, etc.

A disadvantage of Kubernetes is that installing and configuring the platform for production purposes is complex. Also, it requires well-skilled DevOps and developers to maintain clusters. Kubernetes is redundant for simple applications like static websites. But if we have a complex system that performs massive computations, switching to Kubernetes and hiring appropriate specialists makes sense.

Tips For Proper Deployment Apps In Kubernetes

First of all, we need to containerize the application that is going to be deployed in Kubernetes. To do it, create a Dockerfile in which we copy and install all required dependencies and our program. It is considered to create a separate docker image for each process.

Another helpful tip: use tiny base images for containers to reduce the size and increase performance. Also, don’t forget to configure readiness and liveness probes in manifests for apps: it can help monitor the program’s status and restart the container when it is crashed. When releasing manifests, check that the image tags are not “latest”, otherwise you will not be able to deploy a new version of your application. We should store all credentials only in Kubernetes secrets and mount them as env variables to containers. Configuring proper RBACs for secrets, workloads, and other K8s resources is essential. Because only ServiceAccounts, Roles, ClusterRoles, RoleBindings, and CluterRoleBindings can restrict access to the resources for users and apps inside the cluster at the Kubernetes API level. If there are open ports in the application – create a Kubernetes Service and use its name (and its namespace if we are accessing it not from the same namespace) as FQDN in other apps. If the deliverable program stores some persistent data – use Kubernetes StatefulSet; otherwise – Deployment. Kubernetes Job suitable for the process which should be run only once. So Kubernetes is a flexible framework for deploying any workload.

People Also Ask

How is Kubernetes used in Big Data?

Kubernetes ecosystem is growing, and more solutions are being developed for it, including Big Data ones. Kubernetes is an excellent replacement for schedulers and resource managers like Apache Hadoop YARN. Moreover, the K8s central paradigm is horizontal scaling, and that’s why it can manage vast amounts of data efficiently.

We can find many different software solutions for Kubernetes on the Cloud Native Compute Foundation (CNCF) website – databases, streaming and messaging, service mesh and proxies, CI/CD, security, automation, and others. Among them is Kubernetes Spark Operator, which can launch Spark applications, and processed data in these apps can be stored on mounted HDFS storage. Also, there are other Big Data apps like Hive and Kafka, which are also adapted for K8s.

What are the alternatives to Kubernetes?

Before the advent of Kubernetes, virtual machines were used as an environment for distributed apps on several hosts. VMs are resource overhead, because they require significant RAM and CPU resources. Although virtualization and containerization technologies address different challenges, we can consider that VMs are not as flexible as Kubernetes cluster.

Moving virtual machines between different clouds and traditional data centers can be challenging, while all needed infrastructure on K8s cluster can be deployed everywhere by several plain text files. That’s why maintenance is also more simple by Kubernetes standardized API and declarative resource management.

Other analogs in the containerized world, like Docker Swarm, Apache Mesos, Nomad, Rancher, ECS. Among the rest, Kubernetes is the most popular open-source solution with many features and a giant community.

Why does migration to Kubernetes matter?

Monolithic applications are becoming irrelevant, being superseded by microservice solutions. Because in the case of big and complicated products, it’s easy to manage separate services, as each service is responsible for one concrete task, so each service is smaller than a single big app. Also, each service can use different technologies and frameworks based on the business requirements.

Even when one service becomes unavailable, others can still work normally, so there could be only partial outage instead of full as for a monolithic app. Monolithic architecture is a classic way of developing software solutions.

A monolithic app is a single unit serving all functions in one place. Usually, monolithic applications have one large code base, and when developers want to update something, they make changes in the whole stack at once.

Microservices are a collection of small services which are interconnected to form an entire application. All microservices have lightweight communication mechanisms, like HTTP REST APIs. As far as microservices are built independently, each service can use its own stack of technologies.

For example, a service for which the speed of processing data is important can be written on C or Go, even when other services use Java. That’s why such distributed architecture can quickly increase the performance and maintainability of a program, although switching to a new architecture is always a significant step for any project.

But such migration to the Kubernetes platform makes sense and matters invested efforts in the long run. Because K8s means fewer efforts for supporting the underlying stack, it’s achieved by: dynamic scaling out of the box, which can reduce costs; declarative style of describing infrastructure by YAML files; deploying cluster on-premise as well as on cloud, so it’s cloud-agnostic; monitoring tools and configurable health-checks, which help keep apps healthy; embedded deployment patterns for workloads (rolling update, canary, blue-green, etc.); timely cluster upgrade to newer Kubernetes version, which can keep the environment up to date and prevent many security issues.

For high-load solutions it’s critical to prevent interruptions for workloads, and such aim can be easily fulfilled with Kubernetes PodDisruptionBudget resource, which guarantees that a specified number of replicas will be always available. Therefore, Kubernetes clusters efficiently achieve fault tolerance. So the migration to Kubernetes matters despite possible risks, because it can significantly save time in deployment processes.


Kubernetes is a modern and trending tool for running projects of any size. A team of proficient DevOps and software engineers with experience with K8s can deliver the desired solution with high productivity, reliability, and security demands. It can accelerate the product’s release cycle and cut the infrastructure cost because Kubernetes has numerous embedded components and solutions that can be managed at a high abstraction level. Automatic horizontal scaling, load-balancing, replication, service network mesh, distributed storage volumes, and monitoring – is not a complete list of all significant features of K8s.

Data Engineering in Kubernetes is a very relevant topic today because K8s provides the ability to launch distributed computation tasks and other high-loaded workloads. Smooth replication across nodes and GPU support significantly increase the speed of Machine Learning experiments or Data Engineering pipelines.

Kubernetes ecosystem has efficient software for AI, ML, and Big Data projects both for development and production usage. For example, scientists in Kubernetes can create the ML model in Jupyterhub notebook, convert it into an ML pipeline for Kubeflow by Kale plugin and launch a Katib job with hyperparameters to figure out the best one. After that trained model can be served for production by Katib for business apps. And there are plenty of other solutions for Data Engineering (e.g. Spark Operator) which are effective in Kubernetes.