Tips and Tricks for CRaC
This guide assumes you’re familiar with the concepts and API of CRaC, and how a checkpoint can be created and used. Please check out the "Usage Guidelines" first.
Applications and its components often have a simple lifecycle. The application boots, then it is actively used, and in the end it enters shutdown and finally ends up in a terminated state, unable to start back. If the same functionality is needed again, the component is re-created. This allows simpler reasoning and some performance optimizations by making fields final, or not protecting the access to an uninitialized component as the developer knows that it is not published yet.
While usually most of the application can stay as-is, the lifecycle of some components may need to implement the CRaC methods to transition from active to a suspended state and back. In the suspended state, until the whole VM is terminated or before the component is restored, the rest of the application can still be running and can access the component - e.g. a pool of network connections. At that moment, the component is unusable. One solution is to block the thread and unblock it when the component is ready to be used again.
The implementation of this synchronization depends mostly on the threading model of the application. In this guide, the synchronized component is referred as resource, even though it might not implement the Resource
interface directly.
Implementing Resource as Inner Class
In order to encapsulate the functionality, the Resource
interface is sometimes not implemented directly by the component, but in an (anonymous) inner class. However, it is not sufficient to pass this resource to the Context.register()
method. Contexts usually hold references to resources using weak references. As there is no unregister
method on the Context, a strong reference prevents the component from being garbage-collected when the application releases it. Therefore, the inner class needs to be stored inside the component (in a field) to prevent garbage-collection:
public class Component {
private final Resource cracHandler;
public Component() {
/* other initialization */
cracHandler = new Resource() {
@Override
public void beforeCheckpoint(Context<? extends Resource> context) {
/* ... */
}
@Override
public void afterRestore(Context<? extends Resource> context) {
/* ... */
}
}
/* When using just .register(new Resource() { ... }) in here
it would be immediately garbage-collected. */
Core.getGlobalContext().register(cracHandler);
}
}
Common CRaC Patterns
Unknown Number of Threads Arriving Randomly
When you don’t have any guarantees about who is calling into the resource, any access can be blocked with java.util.concurrent.ReadWriteLock
. Any existing access to an object is guarded by the Read lock, introducing no changes to the way the object is used. However, beforeCheckpoint() acquires the Write lock, and afterRestore() releases the Write lock. Therefore, the idea of the pattern is to block the checkpoint until all regular accesses to the object are finished.
public class ConnectionPool implements Resource {
private final ReadWriteLock lock = new ReentrantReadWriteLock();
private final Lock readLock = lock.readLock();
private final Lock writeLock = lock.writeLock();
/* Constructor registers this Resource */
public Connection getConnection() {
readLock.lock();
try {
/* actual code fetching the connection */
/* In this example the access to the connection itself
is not protected and the application must be able
to handle a closed connection. */
} finally {
readLock.unlock();
}
}
@Override
public void beforeCheckpoint(Context<? extends Resource> context) throws Exception {
writeLock.lock();
/* close all connections */
/* Note: if this method throws an exception CRaC will try
to restore the resource by calling afterRestore() -
no need to unlock the lock here */
}
@Override
public void afterRestore(Context<? extends Resource> context) throws Exception {
try {
/* initialize connections if needed */
} finally {
writeLock.unlock();
}
}
}
This solution has the obvious drawback of adding contention on the hot path, the getConnection()
method. Even though the readers don’t block each other, the implementation of read locking likely has to perform some atomic writes which aren’t cost-free.
CRaC might eventually provide an optimized version for this read-write locking pattern that would move most of the cost to the write lock as this removes the need to optimize for checkpoint performance.
One or Known Number of Periodically Arriving Threads
When there is only a single thread, e.g. fetching a task from a queue, or a known number of parties that arrive to the component often enough, a more efficient solution can be used.
The following example shows the use of a resource that logs data to a file, and assumes that the checkpoint notifications are invoked from another thread. This is the case when the checkpoint creation is triggered through jcmd <pid> JDK.checkpoint
. java.util.concurrent.Phaser
is used because it has a non-interruptible version of waiting.
public class Logger implements Resource {
private final int N = 1; // number of threads calling write()
private volatile Phaser phaser;
public void write(Chunk data) throws IOException {
checkForCheckpoint();
/* do the actual write */
}
public void checkForCheckpoint() throws IOException {
Phaser phaser = this.phaser;
if (phaser != null) {
if (phaser.arriveAndAwaitAdvance() < 0) {
throw new IllegalStateException("Shouldn't terminate here");
}
/* now the resource is suspended */
if (phaser.arriveAndAwaitAdvance() < 0) {
throw new IOException("File could not be open after restore");
}
}
}
@Override
public void beforeCheckpoint(Context<? extends Resource> context) throws Exception {
phaser = new Phaser(N + 1); // +1 for self
phaser.arriveAndAwaitAdvance();
/* close file being written */
}
@Override
public void afterRestore(Context<? extends Resource> context) throws Exception {
Phaser phaser = this.phaser;
this.phaser = null;
try {
/* reopen the file */
phaser.arriveAndAwaitAdvance();
} catch (Exception e) {
phaser.forceTermination();
throw e;
}
}
}
Only one volatile read is required for synchronization on each write()
call, which is generally a cheap operation. However, if one of the expected threads is waiting for a long time, the checkpoint would be blocked. This can be mitigated by using shorter timeouts (e.g. if the thread is polling a queue) or even actively interrupting it from the beforeCheckpoint
method.
Event Loop Model
In case there is a single thread doing request processing, it’s possible to synchronize its execution with the checkpoint. The idea of this pattern is to schedule the work that must quiescent the thread on the checkpoint, and block the checkpoint until the work is encountered. The body of the work should unblock the checkpoint and keep the processing thread in the quiescent state until the restore is complete.
The following example shows a resource sending a heartbeat message.
public class HeartbeatManager implements Runnable, Resource {
public final ScheduledExecutorService eventloop; // single-threaded
public boolean suspended;
public HeartbeatManager(Executor eventloop) {
eventloop.scheduleAtFixedRate(this, 0, 1, TimeUnit.MINUTES);
}
@Override
public void run() {
/* send heartbeat message */
}
@Override
public void beforeCheckpoint(Context<? extends Resource> context) throws Exception {
synchronized (this) {
HeartbeatManager self = this;
executor.execute(() -> {
synchronized (self) {
self.suspended = true;
self.notify();
while (self.suspended) {
self.wait();
}
}
})
while (!suspended) {
wait();
}
}
/* shutdown */
}
@Override
public void afterRestore(Context<? extends Resource> context) throws Exception {
/* restore */
synchronized (this) {
suspended = false;
notify();
}
}
}
Beware that if the single-threaded executor is shared between several components this solution is not applicable in the current form as one resource would block and the others would not be able to get suspended. In this case, it makes sense to centralize the control into one resource per executor, but this is out of scope of this document.
Also note one detail in the example above: if the application is stopped for a long time the task scheduled by the ScheduledExecutorService.scheduleAtFixedRate(…)
tries to keep up after restore and perform all the missed invocations. Handling this behavior, must be a part of the beforeCheckpoint
procedure, cancelling the task and rescheduling it again in afterRestore
.