What is Coordinated Restore at Checkpoint?
Coordinated Restore at Checkpoint (CRaC) is a JDK project that can start projects - on Linux - with shorter time to first transaction and less time and resources to achieve full code speed. CRaC effectively takes a snapshot of the Java process when it is fully warmed up, then uses that snapshot to launch any number of JVMs from this captured state. Not all existing Java programs can run without modification as all resources need to be explicitly checkpointed and restored using the CRaC API. Popular frameworks like Micronaut and Quarkus support CRaC checkpointing out of the box.
When an application starts running, the JVM looks for methods that are hot spots (hence the name HotSpot for the implementation of the JVM that is now the OpenJDK JVM), and compiles them to get better performance than interpreting the bytecodes. This results in fast, optimized code, but has the downside of the JVM needing both time and compute resources to determine which methods to compile and then compile them. This is what we refer to as the warmup time of an application. The fact that this same work has to happen every time we run an application makes the JVM less attractive in certain situations like microservices and serverless computing.
Azul’s ReadyNow! technology, which is part of Prime JVM, offers a solution where the warmup time can be reduced. ReadyNow! allows a running application to store all the state of the compiled methods, and even the compiled code itself.
The CRaC project is another improvement, driven by Azul. CRaC enables you to store the state of an application itself, to reduce the required time to load data and initialize the required structures. This is achieved by using a Checkpoint/Restore approach.
|Currently, CRaC is only available for the specified Linux systems, in version 17 and 21 of Azul Zulu Builds of OpenJDK. For development, you can use the CRaC Java library on any platform.|
Checkpoint/Restore allows a running application to be paused and restarted at some point later in time, potentially on a different machine. One of the overall goals is to support the migration of containers. When performing a checkpoint, essentially, the full context of the process is saved: program counter, registers, stacks, memory-mapped and shared memory and so on. To restore the application, all this data can be reloaded, and (theoretically) it continues from the same point. However, there are some challenges, not least of which are open files, network connections and a sudden change in the value of the system clock.
Since the JVM is just a running application, Checkpoint/Restore can be used to pause and restart the application running on it. The OpenJDK project "Coordinated Restore at Checkpoint (CRaC)" adds APIs to the Java runtime, to make the application aware that it is about to be checkpointed or that it has been restarted. This API is straightforward and imposes some restrictions on an application’s state when it is checkpointed. The restrictions are quite logical: the application must have no open file descriptors or network connections. This dramatically improves the ability to reliably restart an application from a given checkpoint.
As part of this project, Azul created a proof-of-concept build of JDK 17 to demonstrate the CRaC functionality. The results from these first tests were very promising. For instance, with a sample Spring Boot application and in a test environment, the time before processing the first operation took roughly four seconds. By using a checkpoint of the running, warmed up application, a restore was able to get to the first operation in 40ms. That’s two orders of magnitude faster!
The chart is based on experiments on the following environment:
Laptop with Intel i7-5500U, 16Gb RAM and SSD.
Linux kernel 5.7.4-arch1-1.
The data was collected in a container running an ubuntu:18.04 based image.
Host operating system: archlinux.