Solving the Stateful Service Problem in Container Orchestration


As we venture further and further into to the beautiful world of containers that all are completely stateless and autoscale as required and share the underlying compute resources we have yet to solve the problem of deploying stateful apps into our container orchestration stacks.

This blog article will give you an introduction to the problem set and describe a few solutions for systems such as kubernetes, fleet and swarm.

If you are new to container orchestration the below links will give you a good introduction. The text that follows assume a basic understanding of containers and container orchestration. The links below will give you good introduction.

Problem definition

Containers in combination with a cluster manager such as kubernetes, swarm or fleet are a wonderful stack solve a lot of deployment & development problems. The 12 factor pattern for building services. This is all oriented towards solving the cattle problem and not the pet problem, see pets vs cattle for an explanation of pets & cattle terminology.

For services that are the backing servicesin the 12 factor pattern, such as mysql, zookeeper or kafka there is no clear solution if you want to host them your self and not use a hosted services provided by GCE, AWS or 3rdparty such as At Traintracks we deploy
inside the customers system and thus need to adapt to whatever they have available and our internally developed database requires a low latency datastore such as directly attached SSDs or proprietary solutions such as EBS or GCE PD.

To take elasticsearch as an example, unless we can separate the lifecycle of the volume from the lifecycle of a container it's impossible to perform many traditional operatios such as configuration changes, software upgrades without a rolling upgrade where we wait for each new container to resync completely after restart. If not separated the containers termination will mean the termination of all associated resources usch as data.

Below is a list of exclusive resources that we want to have an exclusive lock on and make sure are kept together as a unit of resources tied to the instance. All of these need to be combined into one unit and not separately tied as many systems both use the unique id and data in combination to function properly.

  • Data
  • Single writer
  • Hostname
  • IP-address
  • Unique IDs (zookeeper myid for example)

Solution with storage as a service

Flocker proposes a solution to this issue by allowing to treat volumes as a service and thus it can be moved from container to another contanier. This requires you to have a underlying storage such as iSCSI, EBS or GCE PD. Any non relocatable storage is a no go.

Kubernetes can integrate with flocker or utilize their own version of it where the pod can lay claim to a specific volume and thus prevent any overlapping resources and is fully supported by their various controllers(constraint solvers).

They have even gone a step further and implemented a controller called PetSet which is suitable for deploying stageful application as it guarantees unique hostnames and locks to a
specific volume. This is currently in alpha and not suitable for deployment. The roadmap does include things such as local storage, upgrading of images etc.

As PetSets are In kubernetes land you can either implement stateful services with raw pods, replicas=1 eplicationControllers or DaemonSets.

  • k8s setup with rc, replica=1 for elasticsearch


Using the disk as a service pattern has a few tradeoffs, (no silver bullets|free lunch;). Most noteworthy is the lack of performance as it will always ring true unless our understanding of physics is severely flawed (speed of light and all that jazz).

Configuration is unique for each provider and is tied to your choice of manager (k8s or flocker). Each provider library will have their own unique sets of bugs and usual weirdness that you expect from new software.

A curated short list of issues you'll face:

  • Performance is lacking for certain providers (iSCSI, ceph) especially when it comes to latency. Best performance will always be locally attached to the computation hardware.
  • Baremetal servers with attached disks is not supported
  • Obscure providers are not supported at all
  • Configuration will be vendor specific and thus increases the lock in factor

Solutions with a exclusive instance per node

By just running a single instance of the stateful service per computation node we can safely export host resources to it and guarantee no split brain problems can arise.


The unique property of a DaemonSet in kubernetes is that it will make sure to only run one instance on each kubernetes node, thus avoiding the case that there will be multiple instances per compute node. It will in essence run as many instances as there are matching kubernetes nodes but not more than one per node. This gives you a safety guarantee for specific resources for the host* resources such as hostPath, hostPort and so on.

Thus we get a guarantee of no race conditions for two pods running at the same time on the machine. This allows us to use static host mounts safely for stateful
applications without worrying about multiple containers using it.

Rolling updates of DaemonSets

By using delete --no-cascade on the DS you can set up the same DS with changed configuration (as long as the selector still matches the old). --no-cascade will make a non cascading delete of a resource in k8s, this will have the effect of not touching any pods created by the old DaemonSet and the new one can be added without the recreating pods.

You can manually flush the old pods as you go along. This requires some kind of locking mechanism inside the container as new pods will be spun up when another pod
is terminating.

The other option is to create a new DS with exclusive node label selector which ensures a conflict such as elasticsearch-version=v1 and elasticsearch-version=v2.

This allows you to go through a chain of:
1. Create DS with updated settings with node selector of elasticsearch-version=v2
2. Remove label elasticsearch-version on node
3. Wait for pod to terminate on node
4. label node with elasticsearch-version=v2
5. Wait for pod to startup and become ready
6. Do steps 2 - 5 on to next node until there are no more nodes.

This allows a progressive rollout of a new image / configuration with a very small manager controller.

kind: "DaemonSet"  
apiVersion: "extensions/v1beta1"  
  name: elasticsearch-data-v1
      name: "elasticsearch-data"
        name: "elasticsearch-data"
        app: "elasticsearch-data"
        elasticsearch-data-version: "v1"
      terminationGracePeriodSeconds: 240
        elasticsearch-data-node: "true"
        elasticsearch-data-version: "v1"
        - name: elasticsearch
            path: /data/elasticsearch
        - name: "server"
          image: "localhost:5000/elasticsearch:v1"
            - name: "ES_MASTER_HOSTS"
              value: "elasticsearch-master-1,elasticsearch-master-2,elasticsearch-master-3"
            - name: "ES_NODE_TYPE"
              value: "data"
            - name: "ES_CLUSTER_NAME"
              value: "test"
            - name: "ES_HEAP_SIZE"
              value: "512m"
            - name: "ES_INDEX_NUMBER_OF_SHARDS"
              value: "1"
            - name: "ES_INDEX_NUMBER_OF_REPLICAS"
              value: "1"
            - mountPath: /usr/share/elasticsearch/data
              name: elasticsearch
              readOnly: false
            - containerPort: 9200
            - containerPort: 9300

Example deployment flow

# Assuming three kubernetes node cluster that all should run es-data
# named:
# - node1
# - node2
# - node3
kubectl create -f ds.yaml  
# label nodes
for node in node1 node2 node3; do  
  kubectl label $node elasticsearch-data-version=v1
  kubectl label $node elasticsearch-data=true
kubectl get pods  
# Wait until pods are alive
# Create ds-v2.yaml and bump version from v1 to v2 on all variables
cp ds.yaml ds-v2.yaml  
$EDITOR ds-v2.yaml
kubectl create -f ds-v2.yaml  
# Remove label
kubectl label node1 elasticsearch-data-  
# wait for sucessful termination
kubectl get pods  
kubectl label --overwrite node1 elasticsearch-data-version=v2  
# Wait until pod is ready and accepting requests & es cluster is healthy
kubectl get pods  
# repeat the steps until all the nodes are running elastic-data=v2 DS
CoreOS Fleet

CoreOS fleet orchestration tool can be coerced into the same solution as the k8s one by using conflicts as seen in the example below.


# Ensure no more than one elasticearch instance is scheduled per machine

Gist of a full ES deployment, credit goes to digital wonderland.


The future on this subject remains shrouded in uncertainty, PetSets in kubernetes might magically solve all the problems as it's on their road map to support local storage but development in kubernetes moves at a glacier speed, frustrating but good for stability; our kubernetes nodes since day 1 have been rock stable.

Flocker is as well moving forwards and integration with kubernetes, swarm & fleet is getting better by the day and potentially a generic solution will arise but still not fully there.

Traintracks custom built internal service to label nodes dependent on existing directories & mounts and the rolling update system will be open sourced in the coming month. You can subscribe to our blog to be the first to know!

comments powered by Disqus