Monitoring Optimizer Hub

Table of Contents

Using Prometheus and Grafana
Using the REST APIs
Retrieving Optimizer Hub Logs
Extracting Compilation Artifacts
Note About gw-proxy Metrics

Need help?

Schedule a consultation with an Azul performance expert.

You can monitor your Optimizer Hub using one or more of the tools described below.

Using Prometheus and Grafana

The Optimizer Hub components are already configured to expose key metrics for scraping by Prometheus. Follow Configuring Prometheus and Grafana to set up these monitoring tools and check Using the Grafana Dashboard for more info about the different sections of the dashboard.

Using the REST APIs

The Optimizer Hub REST APIs are mostly intended for operational concerns, but you can use them to get information about how your Optimizer Hub instance is being used. We recommend writing a script that scrapes this API regularly for a historical view of the Optimizer Hub usage over time.

Remember that each app is identified by a profile name, that you can configure on the JVM-side with -XX:ProfileName=<name>. This name allows multiple JVM instances that use the same profile name to share the Optimizer Hub functionality, such as the use of ReadyNow Orchestrator profiles and Cloud Native Compiler caching.

Some important questions you can answer with the API:

Of the currently connected client JVMs, how many are running which applications?

Use /api/currentlyConnectedProfileNames to retrieve a list of currently connected profile names and check the value of currentlyConnectedVMInstanceCount.

An example: you can use this approach to find out which two clients are stuck in a reboot loop, causing Cloud Native Compiler to not scale down.
Which application groups are using ReadyNow Orchestrator and which ones are using Cloud Native Compiler?

Call /api/profileNames recursively to get the full list of profile names in an Optimizer Hub instance’s memory then sort by "cncActive": false, "rnoActive": false

How much compute power is each application using?

You may need to charge back the cost for your Optimizer Hub environment to the individual teams, each of which may consume a different amount of resources. Since Cloud Native Compiler is the major consumer of resources, you only need to report per-profile name resource consumption for Cloud Native Compiler. Below you can find two example scripts to achieve this.

You must run this first script periodically (e.g., once a minute) to scrape data from /api/currentlyConnectedProfileNames, and save the result to separate JSON files.

 import requests
import json

api_base_url = 'http://<your-instance>:8120'

def get_currently_connected_profiles(page_token = None):
    params = {'pageToken': page_token} if page_token is not None else {}
    response = requests.get(f'{api_base_url}/api/currentlyConnectedProfileNames', params)
    response.raise_for_status()
    return response.json()

all_profiles = []
page_token = None

while True:
    response = get_currently_connected_profiles(page_token)
    all_profiles += response['data']
    page_token = response['nextPage']
    if page_token is None:
        break

print(json.dumps(all_profiles, indent = 2))

Once enough JSON files are generated, you can aggregate them (important: you need to do this in the order they were taken) to produce a table of resource usage per profile name for the period of time covered by those scrapes.

 import sys
import json
from dataclasses import dataclass
from typing import Dict

@dataclass
class Accumulator:
    prev_sample: float = 0
    total: float = 0
    def add(self, sample: float) -> None:
        if sample >= self.prev_sample:
            self.total += sample - self.prev_sample
        else:
            self.total += sample
        self.prev_sample = sample

def read_scrape(path: str) -> Dict[str, float]:
    profile_usage: Dict[str, float] = {}
    with open(path, 'r') as f:
        data = json.load(f)
    for profile in data:
        name = profile['name']
        usage = (profile['cnc'] or {}).get('resouceUsage', 0)
        profile_usage[name] = usage
    return profile_usage

def main(scrape_paths) -> None:
    usage_by_profile: Dict[str, Accumulator] = {}
    for path in scrape_paths:
        scrape = read_scrape(path)
        for name, usage in scrape.items():
            usage_by_profile.setdefault(name, Accumulator())
            usage_by_profile[name].add(usage)

    total_usage = sum(acc.total for acc in usage_by_profile.values())

    for name, acc in sorted(usage_by_profile.items(), key=lambda item: item[1].total, reverse=True):
        percentage = 100 * acc.total / total_usage if total_usage else 0
        print(f'{name:<30} {acc.total:<15.0f} {percentage:>6.2f}%')

if __name__ == "__main__":
    main(sys.argv[1:])

This will list the detected profiles with a percentage of use:

 exampleProfileName1       7500048012       81.08%
exampleProfileName2       1750024006       18.92%

You can apply these percentages to the total amount your organization has spent on the hosting infrastructure for Optimizer Hub for the same period to split the costs between the applications.

Retrieving Optimizer Hub Logs

All Optimizer Hub components, including third-party ones, log some information to stdout. These logs are very important for diagnosing problems.

You can extract individual logs with the following command:

 kubectl -n my-opthub logs {pod}

However by default Kubernetes keeps only the last 10 MB of logs for every container, which means that in a cluster under load the important diagnostic information can be quickly overwritten by subsequent logs.

You should configure log aggregation from all Optimizer Hub components, so that logs are moved to some persistent storage and then extracted when some issue needs to be analyzed. You can use any log aggregation One suggested way is to use Loki. You can query the Loki logs using the logcli tool.

Here are some common commands you can run to retrieve logs:

Find out host and port where Loki is listening
```
 export LOKI_ADDR=http://{ip-adress}:{port}
```

Get logs of all pods in the selected namespace

 logcli query --since 24h --forward --limit=10000 '{namespace="zvm-dev-3606"}'

Get logs of a single application in the selected namespace

 logcli query --since 24h --forward --limit=10000 '{namespace="zvm-dev-3606" app="compile-broker"}'

Get logs of a single pod in the selected namespace

 logcli query --since 24h --forward --limit=10000 '{namespace="zvm-dev-3606",pod="compile-broker-5fd956f44f-d5hb2"}'

Extracting Compilation Artifacts

Optimizer Hub uploads compiler engine logs to the blob storage. By default, only logs from failed compilations are uploaded.

You can retrieve the logs from your blob storage, which uses the directory structure <compilationId>/<artifactName>. The <compilationId> starts with the VM-Id which you can find in connected-compiler-%p.log:

 # Log command-line option
-Xlog:concomp=info:file=connected-compiler-%p.log::filesize=500M:filecount=20

# Example:
[0.647s][info ][concomp] [ConnectedCompiler] received new VM-Id: 4f762530-8389-4ae9-b64a-69b1adacccf2

Note About `gw-proxy` Metrics

The gw-proxy component in Optimizer Hub uses, by default, /stats/prometheus as target HTTP endpoint to provide metrics. Most other Optimizer Hub components use /q/metrics. If you make manual changes in the configuration of the metrics for individual Kubernetes Deployments in the Optimizer Hub installation, make sure that you don’t use the /q/metrics for the gw-proxy deployment. Doing so would lead to confusion when metrics are processed.