Checkpoint in a Kubernetes Job

Need help?

Schedule a consultation with an Azul performance expert.

A Java application can be executed on Kubernetes as a "canary" to create a checkpoint for later runs. This document describes how to perform a checkpoint and restore end-to-end inside Kubernetes using a minikube cluster, rather than locally on a Linux machine or triggering the checkpoint in a container.

Create a new namespace example

 minikube start
eval $(minikube docker-env)
kubectl create ns example
kubectl config set-context --current --namespace=example

Build a container image locally, based on this example Dockerfile.
```
 docker build -t azul-crac-example:k8s-spring-boot -f k8s/spring-boot/Dockerfile .
```
The first stage builds the application and the second stage adds the netcat utility and two scripts:
- checkpoint.sh starts the application with -XX:CRaCCheckpointTo=… and netcat server listening on port 1111. When somebody connects to this port, the checkpoint via jcmd will be triggered.
- restore-or-start.sh will check the presence of checkpoint image files and either restores from this image, or fallbacks to a regular application startup.
Create resources in the Kubernetes cluster, using the example k8s-file.
```
 kubectl apply -f k8s/spring-boot/k8s.yaml
```
The following resources are created:
- PersistentVolumeClaim representing a storage (in minikube this is bound automatically to a PersistentVolume)
- Deployment that will create the application using the restore-or-start.sh script
- Job that will create the checkpoint image.

Check that the pods are running:

 $ kubectl get po

NAME                                 READY   STATUS    RESTARTS   AGE
create-checkpoint-fsfs4              2/2     Running   0          4s
example-spring-boot-68b69cc8-bbxnx   1/1     Running   0          4s

Check if the application started normally

Explore the application logs (kubectl logs example-spring-boot-68b69cc8-bbxnx); the checkpoint image was not created yet. The other pod, though, hosts two containers: one running checkpoint.sh and the other warming the application up using siege, and then triggering the checkpoint through connection on port 1111 (this is not a built-in feature, remember that we use netcat in the background).

After a while the job completes:

 $ kubectl get job

NAME                STATUS     COMPLETIONS   DURATION   AGE
create-checkpoint   Complete   1/1           19s        44m

Rollout a new deployment

This time the application should restore from the checkpoint image:
```
 kubectl rollout restart deployment/example-spring-boot
```

After a short moment that application is back up

 NAME                                   READY   STATUS      RESTARTS   AGE
create-checkpoint-fsfs4                0/2     Completed   0          95s
example-spring-boot-79b98966db-ml2pj   1/1     Running     0          15s

In the logs you can check if the restore is performed:

 INFO 129 --- [Attach Listener] o.s.c.support.DefaultLifecycleProcessor  : Restarting Spring-managed lifecycle beans after JVM restore
INFO 129 --- [Attach Listener] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat started on port 8080 (http) with context path ''
INFO 129 --- [Attach Listener] o.s.c.support.DefaultLifecycleProcessor  : Spring-managed lifecycle restart completed (restored JVM running for 45 ms)

Verify if the application responds to requests

You must get the "Greetings from Spring Boot!" reply:

 kubectl expose deployment example-spring-boot --type=NodePort --port=8080
URL=$(minikube service example-spring-boot -n example --url)
curl $URL