Using CRaC on Kubernetes
A Java application can be executed on Kubernetes as a "canary" to create a checkpoint for later runs. This document describes how to perform a checkpoint and restore end-to-end inside Kubernetes using a Minikube cluster, rather than locally or triggering the checkpoint in a container. The following uses an example Spring Boot application which is available on GitHub, and you need to adjust it to match your use-case.
-
Create a new namespace
example
:minikube start eval $(minikube docker-env) kubectl create ns example kubectl config set-context --current --namespace=example -
Create a file
Dockerfile.k8s
with the following content:# syntax=docker/dockerfile:1.3-labs FROM azul/zulu-openjdk:23-jdk-crac-latest AS builder RUN apt-get update && apt-get install -y maven ADD . /example-spring-boot RUN cd /example-spring-boot \ && mvn -B install \ && mv target/example-spring-boot-0.0.1-SNAPSHOT.jar /example-spring-boot.jar FROM azul/zulu-openjdk:23-jdk-crac-latest RUN apt-get update && apt-get install -y ncat ENV CRAC_FILES_DIR=/cr COPY --from=builder /example-spring-boot.jar /example-spring-boot.jar # This script is going to be used in the checkpointing job COPY <<'EOF' /checkpoint.sh #!/bin/sh mkdir -p $CRAC_FILES_DIR rm $CRAC_FILES_DIR/* || true # After receiving connection on port 1111 trigger the checkpoint # (using numeric address to avoid IPv6 problems) (nc -v -l -p 1111 && jcmd example-spring-boot.jar JDK.checkpoint) & # This can't exec java because the pod would be marked as failed when it exits # with exit code 137 after checkpoint java -XX:CRaCCheckpointTo=$CRAC_FILES_DIR -XX:CRaCMinPid=128 -jar /example-spring-boot.jar & PID=$! trap "kill $PID" SIGINT SIGTERM wait $PID || true EOF COPY <<'EOF' /restore-or-start.sh #!/bin/sh if [ -z "$(ls -A $CRAC_FILES_DIR 2> /dev/null)" ]; then echo "No checkpoint found, starting the application normally..." exec java -jar /example-spring-boot.jar else echo "Checkpoint is present, restoring the application..." exec java -XX:CRaCRestoreFrom=$CRAC_FILES_DIR fi EOF ENTRYPOINT [ "bash" ] CMD [ "/restore-or-start.sh" ] -
Build image
example-spring-boot-k8s
usingDockerfile.k8s
. The first stage builds the application and the second stage adds thenetcat
utility and two scripts:-
checkpoint.sh
starts the application with-XX:CRaCCheckpointTo=…
andnetcat
server listening on port 1111. When somebody connects to this port, the checkpoint viajcmd
will be triggered. -
restore-or-start.sh
will check the presence of checkpoint image files and either restores from this image, or fallbacks to a regular application startup.docker build -f Dockerfile.k8s -t example-spring-boot-k8s .
-
-
Create a file
k8s.yaml
with the following content:apiVersion: v1 kind: PersistentVolumeClaim metadata: name: crac-image namespace: example spec: accessModes: - ReadWriteOnce resources: requests: storage: 500Mi storageClassName: "standard" --- apiVersion: batch/v1 kind: Job metadata: name: create-checkpoint namespace: example spec: template: spec: containers: - name: workload image: example-spring-boot-k8s imagePullPolicy: IfNotPresent env: - name: CRAC_FILES_DIR value: /var/crac/image args: - /checkpoint.sh securityContext: capabilities: add: - CHECKPOINT_RESTORE - SYS_PTRACE volumeMounts: - mountPath: /var/crac name: crac-image - name: warmup image: jstarcher/siege imagePullPolicy: IfNotPresent command: - /bin/sh - -c - | while ! nc -z localhost 8080; do sleep 0.1; done siege -c 1 -r 100000 -b http://localhost:8080 echo "Do checkpoint, please" | nc -v localhost 1111 restartPolicy: Never volumes: - name: crac-image persistentVolumeClaim: claimName: crac-image --- apiVersion: apps/v1 kind: Deployment metadata: name: example-spring-boot namespace: example labels: app: example-spring-boot spec: replicas: 1 selector: matchLabels: app: example-spring-boot template: metadata: labels: app: example-spring-boot spec: containers: - name: workload image: example-spring-boot-k8s imagePullPolicy: IfNotPresent env: - name: CRAC_FILES_DIR value: /var/crac/image ports: - containerPort: 8080 volumeMounts: - mountPath: /var/crac name: crac-image volumes: - name: crac-image persistentVolumeClaim: claimName: crac-image readOnly: true -
By using this
k8s.yaml
, the following resources are created:-
PersistentVolumeClaim representing a storage (in Minikube this is bound automatically to a PersistentVolume)
-
Deployment that will create the application using the
restore-or-start.sh
script -
Job that will create the checkpoint image.
-
-
Apply resources with
k8s.yaml
and observe that this creates two pods:$ kubectl apply -f k8s.yaml $ kubectl get po NAME READY STATUS RESTARTS AGE create-checkpoint-fsfs4 2/2 Running 0 4s example-spring-boot-68b69cc8-bbxnx 1/1 Running 0 4s -
Explore the application logs (
kubectl logs example-spring-boot-68b69cc8-bbxnx
) to check if the application started normally; the checkpoint image was not created yet. The other pod, though, hosts two containers: one runningcheckpoint.sh
and the other warming the application up usingsiege
, and then triggering the checkpoint through connection on port 1111 (this is not a built-in feature, remember that we usenetcat
in the background). -
After a while the job completes:
$ kubectl get job NAME STATUS COMPLETIONS DURATION AGE create-checkpoint Complete 1/1 19s 44m -
Now you can rollout a new deployment, this time restoring the application from the checkpoint image:
kubectl rollout restart deployment/example-spring-boot -
After a short moment that application is back up:
NAME READY STATUS RESTARTS AGE create-checkpoint-fsfs4 0/2 Completed 0 95s example-spring-boot-79b98966db-ml2pj 1/1 Running 0 15s -
In the logs you can check if the restore is performed:
2024-09-30T07:52:11.858Z INFO 129 --- [Attach Listener] o.s.c.support.DefaultLifecycleProcessor : Restarting Spring-managed lifecycle beans after JVM restore 2024-09-30T07:52:11.866Z INFO 129 --- [Attach Listener] o.s.b.w.embedded.tomcat.TomcatWebServer : Tomcat started on port 8080 (http) with context path '' 2024-09-30T07:52:11.868Z INFO 129 --- [Attach Listener] o.s.c.support.DefaultLifecycleProcessor : Spring-managed lifecycle restart completed (restored JVM running for 45 ms) -
Finally, verify if the application responds to requests. You must get the "Greetings from Spring Boot!" reply:
kubectl expose deployment example-spring-boot --type=NodePort --port=8080 URL=$(minikube service example-spring-boot -n example --url) curl $URL