Visit Azul.com Support

Using CRaC on Kubernetes

Need help?
Schedule a consultation with an Azul performance expert.
Contact Us

A Java application can be executed on Kubernetes as a "canary" to create a checkpoint for later runs. This document describes how to perform a checkpoint and restore end-to-end inside Kubernetes using a Minikube cluster, rather than locally or triggering the checkpoint in a container. The following uses an example Spring Boot application which is available on GitHub, and you need to adjust it to match your use-case.

  1. Create a new namespace example:

     
    minikube start eval $(minikube docker-env) kubectl create ns example kubectl config set-context --current --namespace=example
  2. Create a file Dockerfile.k8s with the following content:

     
    # syntax=docker/dockerfile:1.3-labs FROM azul/zulu-openjdk:23-jdk-crac-latest AS builder RUN apt-get update && apt-get install -y maven ADD . /example-spring-boot RUN cd /example-spring-boot \ && mvn -B install \ && mv target/example-spring-boot-0.0.1-SNAPSHOT.jar /example-spring-boot.jar FROM azul/zulu-openjdk:23-jdk-crac-latest RUN apt-get update && apt-get install -y ncat ENV CRAC_FILES_DIR=/cr COPY --from=builder /example-spring-boot.jar /example-spring-boot.jar # This script is going to be used in the checkpointing job COPY <<'EOF' /checkpoint.sh #!/bin/sh mkdir -p $CRAC_FILES_DIR rm $CRAC_FILES_DIR/* || true # After receiving connection on port 1111 trigger the checkpoint # (using numeric address to avoid IPv6 problems) (nc -v -l -p 1111 && jcmd example-spring-boot.jar JDK.checkpoint) & # This can't exec java because the pod would be marked as failed when it exits # with exit code 137 after checkpoint java -XX:CRaCCheckpointTo=$CRAC_FILES_DIR -XX:CRaCMinPid=128 -jar /example-spring-boot.jar & PID=$! trap "kill $PID" SIGINT SIGTERM wait $PID || true EOF COPY <<'EOF' /restore-or-start.sh #!/bin/sh if [ -z "$(ls -A $CRAC_FILES_DIR 2> /dev/null)" ]; then echo "No checkpoint found, starting the application normally..." exec java -jar /example-spring-boot.jar else echo "Checkpoint is present, restoring the application..." exec java -XX:CRaCRestoreFrom=$CRAC_FILES_DIR fi EOF ENTRYPOINT [ "bash" ] CMD [ "/restore-or-start.sh" ]
  3. Build image example-spring-boot-k8s using Dockerfile.k8s. The first stage builds the application and the second stage adds the netcat utility and two scripts:

    • checkpoint.sh starts the application with -XX:CRaCCheckpointTo=…​ and netcat server listening on port 1111. When somebody connects to this port, the checkpoint via jcmd will be triggered.

    • restore-or-start.sh will check the presence of checkpoint image files and either restores from this image, or fallbacks to a regular application startup.

       
      docker build -f Dockerfile.k8s -t example-spring-boot-k8s .
  4. Create a file k8s.yaml with the following content:

     
    apiVersion: v1 kind: PersistentVolumeClaim metadata: name: crac-image namespace: example spec: accessModes: - ReadWriteOnce resources: requests: storage: 500Mi storageClassName: "standard" --- apiVersion: batch/v1 kind: Job metadata: name: create-checkpoint namespace: example spec: template: spec: containers: - name: workload image: example-spring-boot-k8s imagePullPolicy: IfNotPresent env: - name: CRAC_FILES_DIR value: /var/crac/image args: - /checkpoint.sh securityContext: capabilities: add: - CHECKPOINT_RESTORE - SYS_PTRACE volumeMounts: - mountPath: /var/crac name: crac-image - name: warmup image: jstarcher/siege imagePullPolicy: IfNotPresent command: - /bin/sh - -c - | while ! nc -z localhost 8080; do sleep 0.1; done siege -c 1 -r 100000 -b http://localhost:8080 echo "Do checkpoint, please" | nc -v localhost 1111 restartPolicy: Never volumes: - name: crac-image persistentVolumeClaim: claimName: crac-image --- apiVersion: apps/v1 kind: Deployment metadata: name: example-spring-boot namespace: example labels: app: example-spring-boot spec: replicas: 1 selector: matchLabels: app: example-spring-boot template: metadata: labels: app: example-spring-boot spec: containers: - name: workload image: example-spring-boot-k8s imagePullPolicy: IfNotPresent env: - name: CRAC_FILES_DIR value: /var/crac/image ports: - containerPort: 8080 volumeMounts: - mountPath: /var/crac name: crac-image volumes: - name: crac-image persistentVolumeClaim: claimName: crac-image readOnly: true
  5. By using this k8s.yaml, the following resources are created:

    • PersistentVolumeClaim representing a storage (in Minikube this is bound automatically to a PersistentVolume)

    • Deployment that will create the application using the restore-or-start.sh script

    • Job that will create the checkpoint image.

  6. Apply resources with k8s.yaml and observe that this creates two pods:

     
    $ kubectl apply -f k8s.yaml $ kubectl get po NAME READY STATUS RESTARTS AGE create-checkpoint-fsfs4 2/2 Running 0 4s example-spring-boot-68b69cc8-bbxnx 1/1 Running 0 4s
  7. Explore the application logs (kubectl logs example-spring-boot-68b69cc8-bbxnx) to check if the application started normally; the checkpoint image was not created yet. The other pod, though, hosts two containers: one running checkpoint.sh and the other warming the application up using siege, and then triggering the checkpoint through connection on port 1111 (this is not a built-in feature, remember that we use netcat in the background).

  8. After a while the job completes:

     
    $ kubectl get job NAME STATUS COMPLETIONS DURATION AGE create-checkpoint Complete 1/1 19s 44m
  9. Now you can rollout a new deployment, this time restoring the application from the checkpoint image:

     
    kubectl rollout restart deployment/example-spring-boot
  10. After a short moment that application is back up:

     
    NAME READY STATUS RESTARTS AGE create-checkpoint-fsfs4 0/2 Completed 0 95s example-spring-boot-79b98966db-ml2pj 1/1 Running 0 15s
  11. In the logs you can check if the restore is performed:

     
    2024-09-30T07:52:11.858Z INFO 129 --- [Attach Listener] o.s.c.support.DefaultLifecycleProcessor : Restarting Spring-managed lifecycle beans after JVM restore 2024-09-30T07:52:11.866Z INFO 129 --- [Attach Listener] o.s.b.w.embedded.tomcat.TomcatWebServer : Tomcat started on port 8080 (http) with context path '' 2024-09-30T07:52:11.868Z INFO 129 --- [Attach Listener] o.s.c.support.DefaultLifecycleProcessor : Spring-managed lifecycle restart completed (restored JVM running for 45 ms)
  12. Finally, verify if the application responds to requests. You must get the "Greetings from Spring Boot!" reply:

     
    kubectl expose deployment example-spring-boot --type=NodePort --port=8080 URL=$(minikube service example-spring-boot -n example --url) curl $URL