Tips and Tricks for CRaC
This guide assumes you’re familiar with the concepts and API of CRaC, and how a checkpoint can be created and used. Please check out the "Usage Guidelines" first.
Restoring With New Arguments
In most cases, you do not need to provide arguments to the JVM when restoring because the restored execution will already be based on the arguments specified before checkpoint. For example:
# Start a JVM to be checkpointed:
java -XX:CRaCCheckpointTo=crac-image MyMainClass arg1 arg2
# Restore from a checkpoint; the old arguments are "baked in"
# so we don't need to specify them again:
java -XX:CRaCRestoreFrom=crac-image
However, CRaC allows you to provide a new main class and a new set of arguments for it, in case you wish to alter the execution after restore:
# Restore from a checkpoint with new arguments
java -XX:CRaCRestoreFrom=crac-image MyNewMainClass newArg1 newArg2
When you restore the application with the new main class and new arguments, after all CRaC resources have been successfully processed, the thread which initiated the checkpoint finds the static void main(String[])
method in MyNewMainClass
and invokes it with the provided set of arguments (which can be empty) on top of its existing call stack. After this new main method completes, the thread will proceed with executing the code placed after the checkpoint call.
Note
|
Since you can’t extend the class path on restore, the new main class is required to be available on the class path defined before the checkpoint is created. It is also not possible to use the -jar app.jar syntax on restore to launch the main class from a JAR file. These limitations may be lifted in the future.
|
Note
|
When you create a checkpoint via jcmd , the single thread that executes all diagnostic commands will be the one executing the new main. As a result, new commands will be blocked from being executed until the new main completes. Thus, when using new arguments, it is recommended to either initiate the checkpoint on an application thread or start a new thread from the new main.
|
Example: Using New Arguments on Restore for CRaC-ing javac
The Zulu CRaC distributions contain a demo of a CRaC-ed javac
based on the restore arguments feature in <java-dir>/demo/crac/JavaCompilerCRaC
. There, you can find an executable JAR and the source code to understand how this works.
This is how the CRaC-ed javac
can be used:
-
Implement some Java test classes to be compiled, suppose they are
MyClass1.java
andMyClass2.java
. -
Warm-up the CRaC-ed
javac
by compilingMyClass1.java
and checkpoint it:java -XX:CRaCCheckpointTo=javac-image \ -jar <java-dir>/demo/crac/JavaCompilerCRaC/JavaCompilerCRaC.jar \ MyClass1.javaAt this point, the default main class of
JavaCompilerCRaC.jar
is invoked withMyClass1.java
as an argument. The default main class invokes the standardjavac
to compileMyClass1.java
and then initiates a checkpoint. -
Restore and use the warmed-up CRaC-ed
javac
to compileMyClass2.java
:java -XX:CRaCRestoreFrom=javac-image \ Compile MyClass2.java-
At this point, we specify a class named
Compile
and contained inJavaCompilerCRaC.jar
as the new main class andMyClass2.java
as an argument for its main method. -
Compile
invokes the standardjavac
, which was warmed up before checkpoint, to compileMyClass2.java
and completes. -
The execution then returns to the default main class of
JavaCompilerCRaC.jar
which also completes and thus the restored JVM exits.
-
Using different CPU Counts and Memory Sizes During Checkpoint and Restore
In the current implementation the internal VM threads, Java heap size and other internal data structure sizes are fixed according to the size of the container/system during checkpoint. They do not automatically adjust to the new configuration during restore.
Having more CPUs or more memory during restore should not cause any stability issues. Having lower number of CPUs or memory during restore may result in the JVM being over-sized and potentially lead to stability or performance issues.
Application Lifecycle with CRaC
Applications and their components often have a simple lifecycle. The application boots, then it is actively used, and in the end it enters shutdown and finally ends up in a terminated state, unable to start back. If the same functionality is needed again, the component is re-created. This allows simpler reasoning and some performance optimizations by making fields final, or not protecting the access to an uninitialized component as the developer knows that it is not published yet.
While usually most of the application can stay as-is, the lifecycle of some components may need to implement the CRaC methods to transition from active to a suspended state and back. In the suspended state, until the whole VM is terminated or before the component is restored, the rest of the application can still be running and can access the component - e.g. a pool of network connections. At that moment, the component is unusable. One solution is to block the thread and unblock it when the component is ready to be used again.
The implementation of this synchronization depends mostly on the threading model of the application. In this guide, the synchronized component is referred as resource, even though it might not implement the Resource
interface directly.
Implementing Resource as Inner Class
In order to encapsulate the functionality, the Resource
interface is sometimes not implemented directly by the component, but in an (anonymous) inner class. However, it is not sufficient to pass this resource to the Context.register()
method. Contexts usually hold references to resources using weak references. As there is no unregister
method on the Context, a strong reference prevents the component from being garbage-collected when the application releases it. Therefore, the inner class needs to be stored inside the component (in a field) to prevent garbage-collection:
public class Component {
private final Resource cracHandler;
public Component() {
/* other initialization */
cracHandler = new Resource() {
@Override
public void beforeCheckpoint(Context<? extends Resource> context) {
/* ... */
}
@Override
public void afterRestore(Context<? extends Resource> context) {
/* ... */
}
}
/* When using just .register(new Resource() { ... }) in here
it would be immediately garbage-collected. */
Core.getGlobalContext().register(cracHandler);
}
}
Common CRaC Patterns
Unknown Number of Threads Arriving Randomly
When you don’t have any guarantees about who is calling into the resource, any access can be blocked with java.util.concurrent.ReadWriteLock
. Any existing access to an object is guarded by the Read lock, introducing no changes to the way the object is used. However, beforeCheckpoint() acquires the Write lock, and afterRestore() releases the Write lock. Therefore, the idea of the pattern is to block the checkpoint until all regular accesses to the object are finished.
public class ConnectionPool implements Resource {
private final ReadWriteLock lock = new ReentrantReadWriteLock();
private final Lock readLock = lock.readLock();
private final Lock writeLock = lock.writeLock();
/* Constructor registers this Resource */
public Connection getConnection() {
readLock.lock();
try {
/* actual code fetching the connection */
/* In this example the access to the connection itself
is not protected and the application must be able
to handle a closed connection. */
} finally {
readLock.unlock();
}
}
@Override
public void beforeCheckpoint(Context<? extends Resource> context) throws Exception {
writeLock.lock();
/* close all connections */
/* Note: if this method throws an exception CRaC will try
to restore the resource by calling afterRestore() -
no need to unlock the lock here */
}
@Override
public void afterRestore(Context<? extends Resource> context) throws Exception {
try {
/* initialize connections if needed */
} finally {
writeLock.unlock();
}
}
}
This solution has the obvious drawback of adding contention on the hot path, the getConnection()
method. Even though the readers don’t block each other, the implementation of read locking likely has to perform some atomic writes which aren’t cost-free.
CRaC might eventually provide an optimized version for this read-write locking pattern that would move most of the cost to the write lock as this removes the need to optimize for checkpoint performance.
One or Known Number of Periodically Arriving Threads
When there is only a single thread, e.g. fetching a task from a queue, or a known number of parties that arrive to the component often enough, a more efficient solution can be used.
The following example shows the use of a resource that logs data to a file, and assumes that the checkpoint notifications are invoked from another thread. This is the case when the checkpoint creation is triggered through jcmd <pid> JDK.checkpoint
. java.util.concurrent.Phaser
is used because it has a non-interruptible version of waiting.
public class Logger implements Resource {
private final int N = 1; // number of threads calling write()
private volatile Phaser phaser;
public void write(Chunk data) throws IOException {
checkForCheckpoint();
/* do the actual write */
}
public void checkForCheckpoint() throws IOException {
Phaser phaser = this.phaser;
if (phaser != null) {
if (phaser.arriveAndAwaitAdvance() < 0) {
throw new IllegalStateException("Shouldn't terminate here");
}
/* now the resource is suspended */
if (phaser.arriveAndAwaitAdvance() < 0) {
throw new IOException("File could not be open after restore");
}
}
}
@Override
public void beforeCheckpoint(Context<? extends Resource> context) throws Exception {
phaser = new Phaser(N + 1); // +1 for self
phaser.arriveAndAwaitAdvance();
/* close file being written */
}
@Override
public void afterRestore(Context<? extends Resource> context) throws Exception {
Phaser phaser = this.phaser;
this.phaser = null;
try {
/* reopen the file */
phaser.arriveAndAwaitAdvance();
} catch (Exception e) {
phaser.forceTermination();
throw e;
}
}
}
Only one volatile read is required for synchronization on each write()
call, which is generally a cheap operation. However, if one of the expected threads is waiting for a long time, the checkpoint would be blocked. This can be mitigated by using shorter timeouts (e.g. if the thread is polling a queue) or even actively interrupting it from the beforeCheckpoint
method.
Event Loop Model
In case there is a single thread doing request processing, it’s possible to synchronize its execution with the checkpoint. The idea of this pattern is to schedule the work that must quiescent the thread on the checkpoint, and block the checkpoint until the work is encountered. The body of the work should unblock the checkpoint and keep the processing thread in the quiescent state until the restore is complete.
The following example shows a resource sending a heartbeat message.
public class HeartbeatManager implements Runnable, Resource {
public final ScheduledExecutorService eventloop; // single-threaded
public boolean suspended;
public HeartbeatManager(Executor eventloop) {
eventloop.scheduleAtFixedRate(this, 0, 1, TimeUnit.MINUTES);
}
@Override
public void run() {
/* send heartbeat message */
}
@Override
public void beforeCheckpoint(Context<? extends Resource> context) throws Exception {
synchronized (this) {
HeartbeatManager self = this;
executor.execute(() -> {
synchronized (self) {
self.suspended = true;
self.notify();
while (self.suspended) {
self.wait();
}
}
})
while (!suspended) {
wait();
}
}
/* shutdown */
}
@Override
public void afterRestore(Context<? extends Resource> context) throws Exception {
/* restore */
synchronized (this) {
suspended = false;
notify();
}
}
}
Beware that if the single-threaded executor is shared between several components this solution is not applicable in the current form as one resource would block and the others would not be able to get suspended. In this case, it makes sense to centralize the control into one resource per executor, but this is out of scope of this document.
Also note one detail in the example above: if the application is stopped for a long time the task scheduled by the ScheduledExecutorService.scheduleAtFixedRate(…)
tries to keep up after restore and perform all the missed invocations. Handling this behavior, must be a part of the beforeCheckpoint
procedure, cancelling the task and rescheduling it again in afterRestore
.