High Performance ELK with Kubernetes: Part 1

Set up Elasticsearch and Kibana in a Kubernetes cluster

Published in

Udacity Eng & Data

10 min readAug 7, 2018

Deploying a large Elasticsearch cluster used to be hard, but with the power of Kubernetes and a few simple scripts, anyone can create whatever sized cluster they need. This can be a fun little exercise with Elasticsearch and Kubernetes or a real solution for managing a scalable Elasticsearch cluster at the lowest price possible.

In Part 1, you will see how easy it is to start Elasticsearch and Kibana in a Kubernetes cluster. We will also go over the steps to initialize your first Elasticsearch index templates, and set up to daemons to collects logs and metrics from Kubernetes.

In Part 2, one possible strategy for exposing Elasticsearch and Kibana to the internet will be covered.

Getting Started

Kubernetes

To get the most out of this guide, you are going to need access to a Kubernetes cluster. I recommend Google Kubernetes Engine (GKE), but the majority of this guide can be used on any Kubernetes cluster. In a pinch, you can use minikube so you have a zero cost way to follow along; but this guide is not meant to set up Elasticsearch on a single computer; you’d be much better off running just a single instance of the Elasticsearch instead of five or more.

If you are using GKE, choose a node size with 4, 8, or 16 CPUs, and a cluster size of at least 2. If you aren’t sure how much hardware you need, you can start small to reduce setup costs and scale up later.

If you are going to use minikube to follow along, the default options will not provide enough memory. Make sure you will have enough space by starting the cluster with minikube start --memory 8192 --disk-size 50g --cpus 4.

Check that kubectl is configured to talk with your Kubernetes cluster:

$ kubectl get nodes
NAME                     STATUS    ROLES     AGE       VERSION
gke-my-kubernetes-c...   Ready     <none>    3m        v1.10.5-gke.3
gke-my-kubernetes-c...   Ready     <none>    3m        v1.10.5-gke.3

Did you get at least one ready node back? Great! You will be able to issue commands to your Kubernetes cluster as you go through the guide.

Elasticsearch

Let’s also cover some important Elasticsearch information. Elastic describes Elasticsearch as “a distributed, RESTful search and analytics engine” which maybe sounds a bit better than “database”, so we’ll go with their version. Elasticsearch stores information, but it also makes heavy use of indexing so searching through that data is fast. Distributed is one of the key words here, not only because it adds fault tolerance and redundancy, but because the indexing and searching power at your disposal is related to how many CPUs are in your cluster.

Put simply, if you need to index a lot of information you are going to need a lot of cores. For example, in one hour, I was able to index over 50 GB of complex data in 40 million documents using a 7 node cluster with 16 CPUs each. That’s 112 CPUs total!

For this kind of high performance setup, it’s useful to set up different types of Elasticsearch nodes. In our cluster, we are going to use two different types of nodes:

Master nodes: The elected master node performs cluster duties such as deciding where to place index shards and when to move them. We will use three masters, so we need a quorum of two (N/2 + 1) for our cluster to operate. One master will be able to go offline at a time without interrupting service. If we used one or two master nodes, all nodes would need to be present for the cluster to function. For more information, see here.
Data nodes: The data nodes make up the heart of the cluster. For us, they will store and ingest the data. We will also direct Kibana to use a data node when using the REST API, so they will also be used to coordinate and reduce search results. We create at least two data nodes, which allows for the data to be replicated and provides a backup should one of the nodes go offline.

It is possible to create even more kinds of specialized nodes; for more information, see here.

Configure the provided templates

To get started, clone my elasticsearch-kubed git repo and cd into the new directory:

git clone https://github.com/jswidler/elasticsearch-kubed.git
cd elasticsearch-kubed/

This repo will be used to generate YAML configuration files for your Kubernetes cluster. Run config-templates.py and answer the prompts to fill in the templates. It should look something like this:

The output of the script will be in subdirectory of the clusters directory named after the namespace you choose. You should cd into that directory when running the rest of the commands printed here.

$ cd cluster/default

The Logstash and oauth2_proxy parts are used to expose Elasticsearch on the internet. We’ll skip those for now and come back to them later once the Elasticsearch cluster is set up.

Starting Elasticsearch

As you will see in just a minute, it will only take a second to give Kubernetes all the information it needs to spin up an Elasticsearch cluster. Before we run that command, there are potentially two resources to define first.

The first possible configuration I am talking about is the Kubernetes Namespace for your Elasticsearch related resources. The second is the Kubernetes StorageClass that will be used for persistent volumes used by the Elasticsearch data nodes.

Namespace

By separating different Elasticsearch clusters into different namespaces, you can easily deploy as many Elasticsearch clusters as you need to your Kubernetes cluster. All you need to do is use a different namespace, and the new definitions will not overlap with the existing ones.

If you choose to use the default namespace, then you will skip this step. Otherwise, you should apply the namespace file in the 1_k8s_global directory.

$ kubectl apply -f 1_k8s_global/namespace.yml
namespace/my-es-cluster created
$ kubectl get namespaces
NAME            STATUS    AGE
default         Active    41m
kube-public     Active    41m
kube-system     Active    41m
my-es-cluster   Active    6s

StorageClass

The next step is to define a new StorageClass for the persistent volumes used by the data nodes.

If you are using a cloud based cluster, the StorageClass for the persistent volume will be ssd (in 2_elasticsearch/es-data.yml). Two options are provided to create a StorageClass with that name, one for GCE and one for AWS. Pick whichever one suits your platform. Alternatively, you can look for a more suitable StorageClass here.

If you choose the minikube node size during setup, the storageClassName will be standard which is preinstalled, so you can skip this step.

$ kubectl apply -f 1_k8s_global/gce_storage.yml
storageclass.storage.k8s.io/ssd created

Start the Elasticsearch cluster

We can set the configuration for the Elasticsearch cluster in Kubernetes all at once by applying the directory 2_elasticsearch with kubectl.

$ kubectl apply -f 2_elasticsearch
statefulset.apps/elasticsearch-data created
poddisruptionbudget.policy/elasticsearch-data created
deployment.apps/elasticsearch-master created
poddisruptionbudget.policy/elasticsearch-master created
service/elasticsearch-master created
service/elasticsearch created

This is going to start the Elasticsearch cluster and might take a little while, so we’ll return to check on it in a bit.

Workloads and Services for Elasticsearch

Workloads

Notice in the output that we just created a Deployment for elasticsearch-master and a StatefulSet for elasticsearch-data. Deployments and StatefulSets are controllers for two different Kubernetes Workloads that are very similar to each other. Each Deployment or StatefulSet manages a set of similar Pods.

In both cases you will define your desired state, and the Deployment or StatefulSet controller will attempt to reach your desired state. The primary difference is that with StatefulSets, Pod identities will be reused and the Pods will be updated in a particular order.

For data nodes, we want the volume with the Elasticsearch data to be reattached to a new Pod when the old one is shut down (ie, when updating the StatefulSet configuration). If we did not do this, we would lose all the data in the database after any kind of maintenance. The StatefulSet controller will give the replacement Pod the same name and access to the resources of the Pod it replaced.

Services

Kubernetes Services are used for service discovery and load balancing purposes.

Two services have been created for Elasticsearch nodes to find each other, one for master nodes and one for data nodes. Creating these services allows the use of DNS within Kubernetes to find specific Pods. Requests sent to elasticsearch-master.namespace.svc.cluster.local or elasticsearch.namespace.svc.cluster.local should be routed to one of the Pods we care about, which is determined by the service’s selector.

Both of these two services are created as headless services. This has the practical effect of exposing each Pod that matches the selector as a separate A Record in the DNS table. For instance, from within my cluster, I can use dig to find the address of all the active data nodes.

# dig elasticsearch.default.svc.cluster.local;; QUESTION SECTION:
;elasticsearch.default.svc.cluster.local. IN A;; ANSWER SECTION:
elasticsearch.default.svc.cluster.local. 30 IN A 172.17.0.11
elasticsearch.default.svc.cluster.local. 30 IN A 172.17.0.4

The configuration for both of the workloads are similar, but let’s take some time to appreciate the differences between the two node types.

All of the Pods are created using the same docker image, with Elasticsearch being configured by environment variables. Four environment variables are used to define the capabilities of the node, node.master, node.ingest, node.data, and search.remote.connect. Learn more about what each of these options mean here.
There are two kinds of health checks, but the master nodes will only use one of them. There is a livenessProbe, which ensures Elasticsearch is listening on port 9300 to confirm the application is running. There is also a readinessProbe, which performs an Elasticsearch cluster health check and expects a 200 response. The master nodes only use livenessProbe, so they will be added to the DNS entry for elasticsearch-master as soon as Elasticsearch is running, regardless of whether or not the Elasticsearch cluster is actually available. The data nodes also have a readinessProbe, which means clients connecting to elasticsearch.default.svc.cluster.local will not see those nodes until after they are able to join the cluster through their discovery mechanism.
Of course, a big difference between a data node and a master node is the attached data volume. Kubernetes will create volumes depending on your StorageClass, which for me means that volumes are created as Google Compute Engine Disks.

Start Kibana

Kibana’s official docker image can also be configured with simple environment variables, so all we need is to do is create a Service and Deployment for it.

$ kubectl apply -f 3_kibana
service/kibana created
deployment.apps/kibana created

The Service we created this time will get a cluster ip, and Kubernetes will load balance it among the available Pods for us instead of creating a unique DNS A Record for each. One way this can be useful is that the IP address will be stable, so even if the Pod is changed, the IP address will not.

Check on the cluster

Check and see if all your Pods are all ready.

$ kubectl get pods
NAME                              READY     STATUS    RESTARTS   AGE
elasticsearch-data-0              1/1       Running   0          33m
elasticsearch-data-1              1/1       Running   0          32m
elasticsearch-master-66c5597...   1/1       Running   0          33m
elasticsearch-master-66c5597...   1/1       Running   0          33m
elasticsearch-master-66c5597...   1/1       Running   0          33m
kibana-5c9767dc4-dqmwt            1/1       Running   0          1m

Hint: If you choose a namespace, use -n NAMESPACE or change your default namespace see your Pods.

At this point, we should have successfully started a new Elasticsearch cluster and pointed Kibana at that cluster. To test it, we can use port fowarding to access the cluster from our computer.

$ kubectl port-forward service/elasticsearch 9200
Forwarding from 127.0.0.1:9200 -> 9200
Forwarding from [::1]:9200 -> 9200

And in a separate shell:

$ curl localhost:9200/_cluster/health?pretty
{
  "cluster_name" : "my-es-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 5,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

Set up Index Templates and Kibana dashboards

Now that the Elasticsearch cluster is running, it is a good idea to configure the index templates you intend to use. I am going to be using Filebeat to collect logs and Metricbeat to collect metrics from Kubernetes. In the 4_beats_init folder are four configs for Jobs which configure the index templates or install Kibana dashboards. These Jobs will typically run once, unless they do not exit cleanly. Run the ones you want individually, or else run them all by applying the folder.

$ kubectl apply -f 4_beats_init
job.batch/filebeat-dashboard-init created
job.batch/filebeat-template-init created
job.batch/metricbeat-dashboard-init created
job.batch/metricbeat-template-init created
$ kubectl get jobs
NAME                        DESIRED   SUCCESSFUL   AGE
filebeat-dashboard-init     1         1            1m
filebeat-template-init      1         1            1m
metricbeat-dashboard-init   1         1            1m
metricbeat-template-init    1         1            1m

Once the jobs are successful, your Elasticsearch cluster is ready to use.

Start Filebeat and Metricbeat Daemons

This step is optional, but if you would like to add extra monitoring and log collection to your Kubernetes cluster, Filebeat and Metricbeat make that possible. Kubernetes DaemonSets are another kind of workload that ensure that all Kubernetes nodes run a copy of a Pod, which we can use to deploy the containers. Apply the next directory to launch them*.

$ kubectl apply -f 5_beats_agents
configmap/filebeat-config created
configmap/filebeat-inputs created
daemonset.extensions/filebeat created
clusterrolebinding.rbac.authorization.k8s.io/filebeat created
clusterrole.rbac.authorization.k8s.io/filebeat created
serviceaccount/filebeat created
configmap/metricbeat-config created
configmap/metricbeat-daemonset-modules created
daemonset.extensions/metricbeat created
configmap/metricbeat-deployment-modules created
deployment.apps/metricbeat created
clusterrolebinding.rbac.authorization.k8s.io/metricbeat created
clusterrole.rbac.authorization.k8s.io/metricbeat created
serviceaccount/metricbeat created

*Note: If you are running in GKE, you will need to use the following command to give yourself a cluster-admin role before you will be able to fully deploy the daemons.

kubectl create clusterrolebinding cluster-admin-binding \
   --clusterrole=cluster-admin \
   --user $(gcloud config get-value account)

Second Note: In order for Metricbeat to collect all metrics, you also need to install kube-state-metrics on your cluster.

git clone https://github.com/kubernetes/kube-state-metrics.git
kubectl apply -f kube-state-metrics/kubernetes

Connect to Kibana

For now, we can connect to Kibana the same way we did with Elasticsearch; through Kubernetes port-forwarding. To open the tunnel, run kubectl port-forward service/kibana 5601. Afterwards, open up http://localhost:5601 in your browser to get to Kibana. Set up a default index to begin exploring the data from the beats agents.

If it looks sort of like this, things are going well.

Next…

Congratulations for making it this far! Next up we learn how to expose our cluster on the internet by using Logstash to authenticate data sources, and oauth2_proxy to secure Kibana. Continue on to part two of this guide by following the link.

High Performance ELK with Kubernetes: Part 2

Ready to Learn More?

Check out Udacity’s full catalog

Course Catalog | Udacity

All Courses and Nanodegree Programs

www.udacity.com=

For more from the engineers and data scientists building Udacity, follow us here on Medium.

Interested in joining us @udacity? See our current opportunities.