Running data workloads on Kubernetes.

Not long ago, Kubernetes aimed primarily to serve stateless workloads. With the steady increase in projects that focus on big data processing, the role of Kubernetes changed accordingly. More and more projects utilize Kubernetes to operate stateful applications that demand persistent storage. Databases, Data Lakes as well as applications that require fast filesystems as a backbone for the processing of their information find their sweet spot in Kubernetes. In this article, we’ll explore what’s important when running data workloads on Kubernetes.

Context & starting point

Before diving into the details, it’s good to understand that running stateful applications indeed getting traction. Last year, Google published an article with details about the growth over the last couple of years as well as some trends on key aspects like costs, scalability, resilience, etc. It proves you’re on the right path to explore this new trend.

So where would you start when exploring solutions to run data-intensive projects on Kubernetes? CNCF offers a publicly available Storage Whitepaper (currently version 2) to explore various aspects of the actual storage landscape. This also includes storage in the perspective of Kubernetes and other cloud-native deployment methods.

Connected to the more generic Storage whitepaper, the CNCF has also published the “Data on Kubernetes – databases Whitepaper”. This whitepaper focuses specifically on running databases within the context of Kubernetes. It acts as a set of guidance principles in which the various aspects of the domain come together. Other data solutions are out of scope in this paper.

Running databases

Suppose you choose to run a database like MySQL, MongoDB, Cassandra, or other databases on your Kubernetes infrastructure, you’ll learn the following aspects:

The interplay of custom resources, operators & the management of DB clusters.
Managing the Kubernetes-based resources as well as the interface to the physical databases.

All in all, the whitepaper concentrates on explaining the attributes of storage systems and how they affect the database applications you intend to run. Besides this, the benefits of running databases on Kubernetes are explained. Furthermore, it handles common patterns in terms of running database applications and non-database applications in Kubernetes.

Special care is taken into account for the operational aspects of running databases in Kubernetes. Think of upgrades, backup and restore as well as handling storage capacity and data migration.

The last subject is particularly interesting since it is also related to database schema changes, since it has an effect on LifeCycle management as well.

A practical example: PostgreSQL on Kubernetes

In case you want to try out what it’s like to run PostgreSQL on Kubernetes, the website of CloudNativePG offers a lot of information. Essentially, CloudNativePG is a Kubernetes operator that handles the complete architecture of a highly available database cluster using native streaming replication. There is no external failover management tool needed since it directly interacts with the Kubernetes API server to update the state of the cluster.

From an operational point of view, most of the tasks can be done in a declarative way such as handling user management, application databases & credentials. A built-in exporter that can be configured to your needs helps to export logging and audit data to Prometheus.

Data management

Data management within the perspective of Kubernetes is vital to get the most out of your applications’ data that’s being processed. An open-source framework such as Kanister that operates on the application level helps here.

Shortly said, Kanister enables (data) domain experts to create specific data management blueprints. These can be extended and shared inside the organization. Working with Kanister ensures a homogeneous operational experience for a large-scale set of applications you run in your organization.

Its core benefits are as follows:

Usage of sidecar containers to minimize application changes. This reduces the risks of changes in production-grade workloads.
A large-scale community that is established already ensures you get a quick start for the existing blueprints. No need to start from scratch here.
All of the above can be achieved by domain experts who are experts when it comes to their applications.

PostgreSQL example blueprint

Just like the example which was mentioned earlier, the following blueprint focuses on PostgreSQL. The snippet below takes a snapshot of the given database and stores it in a defined location:

aws rds create-db-snapshot –db-instance-identifier=”{{ index .Object.data “postgres.instanceid” }}” –db-snapshot-identifier=”{{ .Object.metadata.namespace }}-{{ toDate “2006-01-02T15:04:05.999999999Z07:00” .Time | date “2006-01-02T15-04-05″ }}” –region “{{ .Profile.Location.Region }}”

aws rds wait db-snapshot-completed –region “{{ .Profile.Location.Region }}” –db-snapshot-identifier=”{{ .Object.metadata.namespace }}-{{ toDate “2006-01-02T15:04:05.999999999Z07:00” .Time | date “2006-01-02T15-04-05″ }}”

vpcsgid=$(aws rds describe-db-instances –db-instance-identifier=”{{ index .Object.data “postgres.instanceid” }}” –region “{{ .Profile.Location.Region }}” –query ‘DBInstances[].VpcSecurityGroups[].VpcSecurityGroupId’ –output text)

kando output securityGroupID $vpcsgid

dbSubnetGroup=$(aws rds describe-db-instances –db-instance-identifier=”{{ index .Object.data “postgres.instanceid” }}” –region “{{ .Profile.Location.Region }}” –query ‘DBInstances[0].DBSubnetGroup.DBSubnetGroupName’ –output text)

kando output dbSubnetGroup $dbSubnetGroup

Another blueprint offers the reverse operation: restoring a specific snapshot.

As you can see, the blueprints work independently of the actual application in scope. You only need to supply simple parameters and Kanister would do the rest.

Community

There is an active Community called “Data on Kubernetes (DOK)” which offers plenty resources to seek solutions for pending problems, explore use cases as well as generic ways to broaden your knowledge. There are also several talks at various conferences that highlight this topic in greater detail.

Conclusion

Running stateful workloads on Kubernetes becomes more popular over the last couple of years. We’ve explored several sources to get you started to run databases on Kubernetes. Besides this, a number of whitepapers were explored as well as an open-source solution to use blueprints to conduct management tasks on your data-specific topics within the domain of Kubernetes. And finally, two PostgreSQL examples made things concrete.

Explore what’s important when running data workloads on Kubernetes

Context & starting point

Running databases

A practical example: PostgreSQL on Kubernetes

Data management

PostgreSQL example blueprint

Community

Conclusion

LEAVE A REPLY Cancel reply

RELATED ARTICLES

NEWSLETTER

Receive our top stories directly in your inbox!

LET'S CONNECT