Mar 13, 2023
12 min read

OpenTelemetry Metrics Primer for Java Developers

Asaf Mesika
Principal Engineer, StreamNative
OpenTelemetry Metrics Primer for Java Developers
No items found.

I spent the last months learning about OpenTelemetry and its Java SDK while researching how to integrate it into Apache Pulsar at my work at StreamNative, which provides flexible Pulsar-as-a-Service that can run in the cloud. If you don’t know Pulsar, you should — it’s a game-changer technology.

OpenTelemetry is a project that is gaining traction these days. Understanding what it is, its features, and how it works requires quite a substantial amount of time (days), even if you try getting some help from Google by using the articles or videos that appear there. In this blog post, I’ll try to summarize the key information to save you a lot of time.

Super short intro to OpenTelemetry

Before diving into the Metrics part of OpenTelemetry, we need a basic understanding of the project.

OpenTelemetry’s goal is to provide a complete solution for telemetry applications. Telemetry means Metrics, Traces, and Logs. Complete means:

  • Defining an API, meaning a library containing interfaces for you to use to define metrics, report their values, define loggers, report logs, and define traces and report spans for them.
  • Creating implementation for those APIs — called the SDK — which also contains additional functionality for manipulating the telemetry and exporting it in various formats.
  • Creating an efficient protocol for relaying this telemetry data. The protocol here mainly means schema for the data (i.e., Protobuf schema), its encoding (Protobuf), and the protocol to use to carry it on the wire (gRPC or HTTP).
  • A Telemetry Collector, a lightweight process written in Go, which allows you to configure multiple ways to receive the data (protocols, push/pull), transform it, and then send it to various destinations. The latter includes some open-source formats and databases and some proprietary vendors. You can extend it easily by writing a plugin to either of the 3: source, transform, or sink. Most chances, you won’t need to since there are so many community contributions already. You can bundle all the plugins you need yourself or just use a binary distribution (Docker image primarily) by a specific vendor containing their specific plugins.

The novelty of OpenTelemetry (a.k.a., OTel for short) is that they wanted it to look the same way in every language, so they created specifications for both the API and the SDK. If you understand the basic entities of the SDK and the API in one language, switching between different languages using its respective SDK should feel almost the same.

Their end goal is that every library will use OTel API. Today, library owners have two ways to expose their metrics to your application:

  1. Write an extension for each metric framework (Dropwizard, Prometheus Client, Micrometer, etc.) to expose the metrics to it. Application developers using your library will also use the extension, matching their metrics framework.
  2. Not everybody uses the popular metrics frameworks, so library developers are forced to create a bespoke interface (since there aren’t standards yet for this) for supplying the metrics, and you implement this interface to connect it to your custom metrics framework.

OTel aims to be the interface through which the library reports logs and traces as well. In Java, the logging bit feels like that today due to SLF4J, as most libraries are using it and most logging frameworks support a bridge from SLF4J to them. The key difference in OpenTelemetry is that they don’t want to rely on static variables, so they encourage library maintainers to receive the OpenTelemetry interface via a parameter at the library initialization and use that to report metrics, logs, and traces.

The API

Before I explain what the API is used for and what it offers, let’s see a few concepts used in OTel.

Concepts

In OTel Instruments are the entities through which you report measurements. An instrument is very much like in real life, a device, but since this is a programming language, it’s in the form of an object you use through its methods. The instrument methods allow you to report Measurements. For example, add 5 to http.request.body.lines, add -1 to processing.jobs.executing, and report 32 (milliseconds) to http.server.response.latency. The numbers are the measurements.

When you report a measurement to an instrument, you are most likely doing it for specific Attributes. For example, if you have an instrument named http.server.response.latency, you would report a specific response latency together with several attributes of the request, such as response status code and request method:

httpResponseLatency.record(32,
 Attributes.of(
   AttributeKey.longKey("statusCode"), 404L,
   AttributeKey.stringKey("method"), "GET"));

Attributes are key-value pairs of attribute name and attribute value.

Instruments are grouped into Meters, each having a name and a version. All instrument creation is done through a Meter. In your microservice, you will use a meter for its metrics, while your connection pool library will have its Meter and its instruments defined using it.

Instruments

Instruments have a name, like http.request.count, a description (will show up in UIs like Grafana), and a unit. The instruments offered by the API are:

  • Counter — An instrument that only increases and never decreases: DoubleCounter, LongCounter. Examples: HTTP request count, number of logins, etc.
  • UpDownCounter — An instrument that can increase or decrease: DoubleUpDownCounter, LongUpDownCounter. Examples: Number of concurrently running background jobs, number of active connections, etc. It’s a number that you can aggregate across attributes. This is very different from a Gauge.
  • Gauge — An instrument only registered via a callback - a function returning the gauge value. A gauge value cannot be aggregated across attributes. Gauge examples are temperature and CPU usage.
  • Histogram — used to collect measurements that are aggregated to statistically meaningful numbers. OTel supports Explicit Bucket Histograms and Exponential Bucket Histograms, while Summary is not supported (There is an issue addressing that). As opposed to the known metric libraries, in OTel, there isn’t a specific interface for an explicit bucket or exponential histogram (In the Prometheus client, you have Summary for summary and Histogram for explicit bucket histogram). There is a way to configure OpenTelemetry (the SDK — implementation), upon initialization, instructing what histograms would be by default and deciding that also for specific histograms — i.e., decide whether it will be an Explicit Bucket or Exponential Bucket and specify the bucket list. I’ll describe that in the SDK section. The interfaces are DoubleHistogram and LongHistogram.

Here is a code example for defining instruments using the API only.

LongCounter bytesOutCounter = meter.counterBuilder("pulsar_bytes_out")
       .setDescription("Size of messages dispatched from this broker to consumers")
       .setUnit("bytes")
       .build();
    
meter.gaugeBuilder("room_temperature")
       .setUnit("celsius")
       .buildWithCallback(observableDoubleMeasurement ->
               observableDoubleMeasurement.record(
                     RoomManager.currentRoom().getTemperature(),
                     Attributes.of(
                             AttributeKey.stringKey("room"),
                             RoomManager.currentRoom().getName())));

meter.histogramBuilder("http.response.latency")
       .setUnit("seconds")
       .setDescription("HTTP Response Latency")
       .build();

The SDK

As we explained before, the SDK is the implementation of the interfaces contained within the API: MeterProvider, Meter, and all the instruments described above. It also contains several other entities used for reading and exporting the metrics and configuring instruments further (override).

Before we explain Metric Reader, Metric Exporter, and Views, we first need to learn an important concept in OTel called Aggregations.

Aggregations

When you learn OTel for the first time by reading its API or just trying out its API, you stumble across the following scenario ending up with a question: “I just defined a histogram, but I can’t find a way to define its buckets — how can it be?!”

meter.histogramBuilder("http.response.latency")
       .setUnit("seconds")
       .setDescription("HTTP Response Latency")
       .build();

You expected to have setBuckets(10, 100, 1000, 5000), but this method doesn’t exist. There is a logic behind it which is actually pretty amazing, yet there is also ongoing work to add such a method.

The basic idea in the SDK is that an instrument has an associated aggregation, which is an object through which you feed the measurements, and it’s the one deciding how it aggregates those measurements and what it outputs. For example, when you define a Counter, you normally have a Sum aggregation associated with it, adding the measurements you report (those +1, +3) into a sum counter variable. Upon collection, it emits the counter sum so far. Another example is Explicit Bucket aggregation: When you report the measurement, it finds the matching bucket counter, increases it by 1, and increases a sum counter by the measurement. It emits a sum of the values, a count of the values, and a bucket counter counting each value reported matching the bucket boundaries.

There are sensible default aggregations per instrument, like Sum for Counter or Explicit Buckets for a histogram. The latter also comes with a default bucket boundaries list. OTel allows you to override the default aggregation and configure it per instrument using another concept called Views which you configure upon initialization. The last part is exactly why people created the GitHub issue above since, in some cases, it doesn’t make sense to split the definition of a histogram into two separate places in your code.

Views

Views are the most powerful tool OTel SDK offers, and it is a unique feature compared to all other metric libraries.

You can configure multiple views for an instrument. A view allows you to define an aggregation, configure it, and override the name, description, and units. In essence, you create multiple instruments from the same original instrument. Think of it as such: When you defined an instrument with a name, you defined a way to report many numbers (measurements). A view takes all those measurements as input and uses the aggregation defined to create a metric, using the name, units, and description defined in the view (if not defined, take the defaults from the instrument definition). So you can decide, for example, to take http.response.latency which was defined as a histogram, and create 2 views for it:

  1. An explicit bucket histogram, using buckets (1, 10, 1000) named http.response.latency.
  2. A metric showing the last latency collected named http.response.latency.last where you defined a Last aggregation (which only keeps the last measurement reported and emits it as gauge)

If you only define a single view for an instrument, you just override the original definition and perhaps override the default aggregation and its default configuration.

The second strong part about views is that you can also define them to be applied to multiple instruments. For example, you can say that all instruments with histogram type named “*latency” should have their aggregation set to Explicit Histogram and have their buckets be 10, 200, 3000. It is done by something called an Instrument Selector, allowing you to choose multiple instruments based on the following:

  • name wildcard
  • instrument type
  • instrumentation scope (I will explain it later)

For each instrument selected, the view defined will be added.

Here’s a code example:

SdkMeterProvider meterProvider = SdkMeterProvider.builder()
       .registerView(
               InstrumentSelector.builder()
                       .setName("*latency")
                       .build(),
               View.builder()
                       .setAggregation(Aggregation.explicitBucketHistogram(List.of(10.0, 20.0, 100.0)))
                       .build())
       .registerView(
               InstrumentSelector.builder()
                       .setMeterName("hikari")
                       .setType(InstrumentType.HISTOGRAM)
                       .build(),
               View.builder()
                       .setAggregation(Aggregation.explicitBucketHistogram(List.of(2.0, 10.0, 50.0, 200.0)))
                       .build())
       .build();

Views provide a brilliant way to manipulate metrics you didn’t code yourself — coming from the libraries you use. You can decide whether a latency reported in the Hikari Connection Pool library will have buckets as you wish it to be (something you can’t do in other metric frameworks) or even drop it by setting the Drop aggregation for certain instruments of that library.

Finally, views also allow you to select only a subset of the reported attributes, thus achieving less cardinality without losing data since the measurements will be rolled up to your defined attributes.

Your HTTP client may have the following in its code:

var attr = Attributes.of(AttributeKey.longKey("statusCode"), requestStatusCode,
             AttributeKey.stringKey("method"), requestMethod);
httpRequestLatency.record(requestLatency, attr)

You can decide to modify it only to include the attribute statusCode:

.registerView(
       InstrumentSelector.builder()
               .setMeterName("http-commons")
               .setName("http.request.latency")
               .build(),
       View.builder()
               .setAttributeFilter(attrName -> attrName.equals("statusCode"))
               .build())

In the implementation, when you report the value 30 associated with the attributes (statusCode=500, method=GET), it will modify the attributes to be (statusCode=500) and report the value 30 for it; thus, you achieve a roll-up of the (statusCode, method) to statusCode for the instrument the view is configured for. It means that the roll-up is only in the scope of a single instrument, not multiple.

Metric Reader and Exporter

When you initialize the SDK, you can (should) provide a Metric Reader. It’s the component that reads the metrics from the SDK and uses a Metric Exporter to expose them out — either via a pull mechanism (like exposing a REST endpoint that responds with the metrics in a certain format) or a push mechanism which periodically pushes the metrics to the exporter (writing it in OTLP protocol to Open Telemetry Collector).

Some Metric Readers have a bundled exporter like Prometheus Metric Exporter. Others, like the Periodic Metric Reader, require you to pass an exporter when creating them. Exporters can be OTLP gRPC exporters or HTTP OTLP Exporters.

Summary

OTel is, in my opinion, the best metric library created for the JVM. They literally thought of everything and managed to design it with elegance. Using specifications to make all SDKs look the same is brilliant, as it makes moving between languages a breeze, and packing it with an external collector capable of modifying, keeping state, and exporting to all the destinations needed. The only downside OTel has is the documentation, as it requires you to take a few days at the very least to understand how it works and how to use it, and I hope in time, it will improve. This blog post's goal was to try to explain it “shortly,” so in 10–20 minutes of reading, you’ll understand the basic workings of it.

I haven’t touched all the aspects of OTel Metrics — I will leave them to future blog posts. I believe OTel will revolutionize the Metrics JVM frameworks, just like Docker and Maven were in their respective terms.

Asaf Mesika
Asaf Mesika is a Principal Engineer at StreamNative. He combines his passion for clean code, 22 years of experience and appreciation for great team work to build a truly outstanding open-source based event streaming platform. Asaf previously worked at Logz.io, building the core foundations for Logz.io platform. Asaf is also the co-founder of Java.IL, the Israeli Java User Group, fostering a thriving community since 2010, and co-founder of Tech Leads IL - the leading community for Tech Leads in Israel.

Related articles

Apr 11, 2024
5 min read

The New CAP Theorem for Data Streaming: Understanding the Trade-offs Between Cost, Availability, and Performance

Mar 31, 2024
5 min read

Data Streaming Trends from Kafka Summit London 2024

Newsletter

Our strategies and tactics delivered right to your inbox

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
No items found.