What is CRaC?
Coordinated Restore at Checkpoint (CRaC) is a JDK project that allows you to start Java programs with a shorter time to first transaction, combined with less time and resources to achieve full code speed. CRaC effectively takes a snapshot of the Java process (checkpoint) when it is fully warmed up, then uses that snapshot to launch any number of JVMs from this captured state. Not all existing Java programs can run without modification as all resources - using the CRaC API - need to be explicitly closed before you can create a checkpoint, and these resources must be reinitialized after the restore. Popular frameworks like Spring, Micronaut, and Quarkus support CRaC checkpointing out of the box.
When an application starts running, the JVM looks for methods that are hot spots (hence the name HotSpot for the implementation of the JVM that is now the OpenJDK JVM), and compiles them to get better performance compared to interpreting the bytecodes. This results in fast, optimized code, but has the downside of the JVM needing both time and compute resources to determine which methods to compile and then compile them. This is what we refer to as the warmup time of an application. The fact that this same work has to happen every time we run an application makes the JVM less attractive in certain situations like microservices and serverless computing.
With CRaC, you can create a checkpoint with the state of the JVM and its application, and start (restore) as many instances from that checkpoint as required. This reduces the time to load data and initialize the required structures, both on JVM and on application side. In contrast to the more generic Checkpoint/Restore approach implemented in CRIU, CRaC is designed to achieve multiple restores in different environments from a single checkpoint.
CRaC imposes certain restrictions on the state of the application and JVM to guarantee the consistency and safety of the checkpoint. The checkpoint cannot succeed if the program has an open file handle. If the file that is referenced by a handle, changes, the checkpoint state will diverge from the environment. An attempt to use the handle after restore when the file has changed, will lead to unpredictable results. For this reason, CRaC requires that there are no open file handles and sockets when you create a checkpoint. However, this does not mean that only simple applications are compatible with CRaC. Applications can prepare themselves for the checkpoint to satisfy the requirements, and re-initialize themselves after restore.
CRaC provides new Java APIs to register callbacks (resources) with two methods:
afterRestore(). As these names imply, you can run arbitrary code on checkpoint and restore, and you can satisfy CRaC’s requirements for the application state.
Azul’s ReadyNow is another technology, which is part of Azul Prime Builds of OpenJDK, to offer warmup time reduction. ReadyNow allows a running application to store all the state of the compiled methods, and even the compiled code itself, without changing the application.
As part of this project, Azul created a proof-of-concept build of JDK 17 to demonstrate the CRaC functionality. The results from these first tests were very promising. For instance, with a sample Spring Boot application and in a test environment, the time before processing the first operation took roughly four seconds. By using a checkpoint of the running, warmed up application, a restore was able to get to the first operation in 40ms. That’s two orders of magnitude faster!
The chart is based on experiments on the following environment:
Laptop with Intel i7-5500U, 16Gb RAM and SSD.
Linux kernel 5.7.4-arch1-1.
The data was collected in a container running an ubuntu:18.04 based image.
Host operating system: archlinux.