Troubleshooting Cloud Native Compiler
This page shows how to troubleshoot a misbehaving Cloud Native Compiler (CNC) Service and any Azul Zulu Prime Builds of OpenJDK (Azul Zulu Prime JVM) instances using the CNC Service.
Client VM Troubleshooting
My application running in a CNC-enabled VM shows worse performance than usually. What can I do?
-
Double-check VM arguments. Ensure that VM is started with
-XX:CNCHost=
parameter pointing to the address of the CNC Service gateway.See Connecting a JVM to a Cloud Native Compiler for more details on CNC Service-related VM parameters and Installing Cloud Native Compiler for finding out the gateway address.
-
Enable CNC Service log in VM using
-Xlog:concomp
parameter and look for log messages that show the JVM connecting to and disconnecting from CNC service.-
If the log says that the VM fails to connect to the service, check that the service is up and running, check the network connectivity between JVM and service, and check the value of
-XX:CNCHost=
. -
If the log says that VM disconnects from the service soon after connecting, the log should also give the reason for disconnecting. The most frequent reason for such disconnects is a missing Compiler Engine on the service, indicated by the
FAILED_PRECONDITION
error code and messageCompiler engine … not found
. See Registering a New Compiler Engine for more information. -
If the connection between the VM and service is established and does not break, then proceed to item #3.
-
-
Collect VM GC log, open it in GCLA and see top-tier compilation statistics. Top-tier compilation stats can also be seen in VM compilation log (
-XX:+PrintCompilation
).-
If stats show no top-tier compilations started, then check the value of
-XX:CNCMaxConcurrentCompiles
VM parameter. -
If stats show high top-tier compilation failure ratio, then it’s time to troubleshoot the CNC service.
-
Write down the VM ID seen in the VM concomp log, it can be used to filter service events related to this particular VM.
-
Proceed to CNC Server troubleshooting.
-
I see occasional "compiler timeout" errors in service logs and/or grafana dashboard. What’s that?
Every compilation on CNC service has a time limit. By default it’s 500 seconds.
-
If that limit is exceeded, the first thing to check is network latency between VM and the CNC service using
ping <cnc_host>
. Latency should not exceed single-digit milliseconds. If the latency is higher, CNC service won’t deliver its best performance. Make sure to locate VMs close enough to the CNC service. -
In rare cases there are very large compilations that actually require that long. If that’s the case, compilation timeout can be changed by adding -Dcompiler.timeout=N flag to compile-broker, where N is the number in seconds.
My application running in a CNC-enabled VM behaves incorrectly or crashes. What can I do?
-
Collect all VM logs and the
hs_err*
file and send it to Azul for analysis. -
Run the application without the
-XX:CNCHost
flag to verify that the problem is specific to connecting to the CNC Service.
I sometimes see entries about failed compilations because of "ConnectedCompiler is not yet ready", but I see it is compiling fine. Is that ok?
This may happen when running with SSL enabled. The VM keeps an open connection to the service, but sometimes the connection can be reset or re-established. It may happen that the VM tries to send a compilation request in the very moment. With SSL, the VM and the service need to do a handshake to make sure the connection is trusted. It is very quick, but it is possible the VM hits this small window. It is harmless as the compilation is resubmitted the next moment.
Cloud Native Compiler Server Troubleshooting
JVM compilation log shows that top-tier compilations are started, but never finished. What can I do?
This can be caused by one of these reasons:
-
No compile-broker pods are running in the CNC cluster. Make sure that at least one compile-broker is up and running.
-
The CNC Service has too many compilation requests enqueued due to too many VMs connected and it takes too long to provide compiled code. To confirm, check the length of pending-compilations queue in broker. Increase the number of compile-broker replicas.
-
The broker pod has been recreated in the middle of ompilation requests, causing the CNC Service to lose information about compilations requested by VMs. The only way to recover from this is to restart the connected VMs.
I see occasional "vm unreachable" in service logs and/or grafana dashboard. What’s that?
This is caused by the service’s inability to receive some information necessary for the compilation from the JVM. It usually happens when the JVM disconnects from the service for any reason, e.g. JVM termination or a network error. It’s harmless. The service just skips the compilation and proceeds to the next one.
The compile brokers seem to be compiling, but I see the compilation queue rising to very high numbers over time. Also, some VMs do not seem to be running well. What is going on?
This may be a symptom of the CNC infrastructure being overloaded, which may or may not be only a short spike depending on how many VMs you have and what they are doing. It will be necessary to do a little analysis and tweak CNC service or VM settings.
-
The queue never stops rising. - Your CNC resources are set too low and it is not managing to cater for all your compilation requests. Try adding more resources to the compile brokers.
-
The queue grows to a very high number, which slowly starts to go down over time. - When the CNC infrastructure is too busy, it may not talk back to certain VMs immediately. VMs have a timeout after which they try to re-submit their compilation requests. This growth may be caused by several VMs repeatedly re-sending their requests. To alleviate this, it is possible to set the number of concurrent compilation requests sent by VMs. By default, this number is
50
. It can be tuned for all VMs by setting the attributecompilation.limit.per.vm
for the gateway service or per-VM by using the-XX:CNCMaxConcurrentCompiles
flag. -
Did a large number of your VMs start or re-start at the same time? It may be that the idle CNC setup was set too low and it is not scaling up fast enough. Consider allocating more resources in its base idle state.