Announcing: The Pulsar PMC Published The 2020 Apache Pulsar User Survey Report!

Overview
Get started
Install and upgrade
Configure
Overview
Pulsar core
Control center
Pulsar Manager
Prometheus
Grafana
Alertmanager
Tool
Secure
Manage and monitor
Connect
Process
Release notes

Configure Alertmanager

Alertmanager

Alertmanager is a component of "control center" module in StreamNative Platform and a component of Prometheus. Alertmanager handles alerts sent by StreamNnative components, such as Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver integration, such as email, PagerDuty, or OpsGenie. It also takes care of silencing and inhibition of alerts.

Configure

You can configure Alertmanager on StreamNative Platform with an individual component configuration file. By default, the configuration file is located in ${PLATFORM_HOME}/share/sn-alert-manager/alertmanager.yml.

Example 1

How to configure Alertmanager in Prometheus

This topic provides instructions for how to configure the Alert Manager to communicate with Prometheus. You need to configure the following configuration at the ${PLATFORM_HOME}/etc/sn-prometheus/prometheus.yml and replace the targets with your Alert Manager address.

Note: By default, the ${PLATFORM_HOME}/etc/sn-prometheus/prometheus.yml file does not exist. For how to generate the file, see here.

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['127.0.0.1:9093']

Example 2

How to configure Alertmanager for slack

This example explains how to configure Alertmanager to send alert notifications to the Slack.

global:
  resolve_timeout: 1m
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
  slack_api_url: 'slack_api_url'

route:
  group_interval: 1m
  repeat_interval: 10m
  receiver: 'pagerduty-notifications'

receivers:
  - name: 'slack-notifications'
    slack_configs:
    - channel: '#alerts'
      send_resolved: true
      icon_url: https://avatars3.githubusercontent.com/u/3380462
      title: |-
        [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} for {{ .CommonLabels.job }}
        {{- if gt (len .CommonLabels) (len .GroupLabels) -}}
          {{" "}}(
          {{- with .CommonLabels.Remove .GroupLabels.Names }}
            {{- range $index, $label := .SortedPairs -}}
              {{ if $index }}, {{ end }}
              {{- $label.Name }}="{{ $label.Value -}}"
            {{- end }}
          {{- end -}}
          )
        {{- end }}
      text: >-
        {{ range .Alerts -}}
        *Alert:* {{ .Annotations.title }}{{ if .Labels.severity }} - `{{ .Labels.severity }}`{{ end }}

        *Description:* {{ .Annotations.description }}

        *Details:*
          {{ range .Labels.SortedPairs }} - *{{ .Name }}:* `{{ .Value }}`
          {{ end }}
        {{ end }}

Start

Use the following command to start Alertmanager after you finish the configuration.

${PLATFORM_HOME}/bin/sn-alertmanager

Once the Alertmanager is started successfully:

Alert rules

Alerting rules allow you to define alert conditions based on Prometheus expression language expressions and to send notifications about firing alerts to an external service. Whenever the alert expression results in one or more vector elements at a given point in time, the alert counts as active for these elements label sets.

For more information about alert rules, see here.

Example: How to configure alert rules with Pulsar

This example shows how to configure alert rules with Pulsar.

groups:
  - name: node
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          status: danger
        annotations:
          summary: "Instance {{ $labels.instance }} down."
          description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."

      - alert: HighCpuUsage
        expr: (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)) > 60
        for: 1m
        labels:
          status: warning
        annotations:
          summary: "High cpu usage."
          description: "High cpu usage on instance {{ $labels.instance }} of job {{ $labels.job }} over than 60%, current value is {{ $value }}"

      - alert: HighIOUtils
        expr: irate(node_disk_io_time_seconds_total[1m]) > 0.6
        for: 1m
        labels:
          status: warning
        annotations:
          summary: "High IO utils."
          description: "High IO utils on instance {{ $labels.instance }} of job {{ $labels.job }} over than 60%, current value is {{ $value }}%"

      - alert: HighDiskUsage
        expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes)  / node_filesystem_size_bytes > 0.8
        for: 1m
        labels:
          status: warning
        annotations:
          summary: "High disk usage"
          description: "High IO utils on instance {{ $labels.instance }} of job {{ $labels.job }} over than 60%, current value is {{ $value }}%"

      - alert: HighInboundNetwork
        expr: rate(node_network_receive_bytes_total{instance="$instance", device!="lo"}[5s]) or irate(node_network_receive_bytes_total{instance="$instance", device!="lo"}[5m]) / 1024 / 1024 > 512
        for: 1m
        labels:
          status: warning
        annotations:
          summary: "High inbound network"
          description: "High inbound network on instance {{ $labels.instance }} of job {{ $labels.job }} over than 512MB/s, current value is {{ $value }}/s"

  - name: zookeeper
    rules:
      - alert: HighWatchers
        expr: zookeeper_server_watches_count{job="zookeeper"} > 1000000
        for: 30s
        labels:
          status: warning
        annotations:
          summary: "Watchers of Zookeeper server is over than 1000k."
          description: "Watchers of Zookeeper server {{ $labels.instance }} is over than 1000k, current value is {{ $value }}."

      - alert: HighEphemerals
        expr: zookeeper_server_ephemerals_count{job="zookeeper"} > 10000
        for: 30s
        labels:
          status: warning
        annotations:
          summary: "Ephemeral nodes of Zookeeper server is over than 10k."
          description: "Ephemeral nodes of Zookeeper server {{ $labels.instance }} is over than 10k, current value is {{ $value }}."

      - alert: HighConnections
        expr: zookeeper_server_connections{job="zookeeper"} > 10000
        for: 30s
        labels:
          status: warning
        annotations:
          summary: "Connections of Zookeeper server is over than 10k."
          description: "Connections of Zookeeper server {{ $labels.instance }} is over than 10k, current value is {{ $value }}."

      - alert: HighDataSize
        expr: zookeeper_server_data_size_bytes{job="zookeeper"} > 107374182400
        for: 30s
        labels:
          status: warning
        annotations:
          summary: "Data size of Zookeeper server is over than 100TB."
          description: "Data size of Zookeeper server {{ $labels.instance }} is over than 100TB, current value is {{ $value }}."

      - alert: HighRequestThroughput
        expr: sum(irate(zookeeper_server_requests{job="zookeeper"}[30s])) by (type) > 1000
        for: 30s
        labels:
          status: warning
        annotations:
          summary: "Request throughput on Zookeeper server is over than 1000 in 30 seconds."
          description: "Request throughput of {{ $labels.type}} on Zookeeper server {{ $labels.instance }} is over than 1k, current value is {{ $value }}."

      - alert: HighRequestLatency
        expr: zookeeper_server_requests_latency_ms{job="zookeeper", quantile="0.99"} > 100
        for: 30s
        labels:
          status: warning
        annotations:
          summary: "Request latency on Zookeeper server is over than 100ms."
          description: "Request latency {{ $labels.type }} in p99 on Zookeeper server {{ $labels.instance }} is over than 100ms, current value is {{ $value }} ms."

  - name: bookie
    rules:
      - alert: HighEntryAddLatency
        expr: bookkeeper_server_ADD_ENTRY_REQUEST{job="bookie", quantile="0.99", success="true"} > 100
        for: 30s
        labels:
          status: warning
        annotations:
          summary: "Entry add latency is over than 100ms"
          description: "Entry add latency on bookie {{ $labels.instance }} is over than 100ms, current value is {{ $value }}."

      - alert: HighEntryReadLatency
        expr: bookkeeper_server_READ_ENTRY_REQUEST{job="bookie", quantile="0.99", success="true"} > 1000
        for: 30s
        labels:
          status: warning
        annotations:
          summary: "Entry read latency is over than 1s"
          description: "Entry read latency on bookie {{ $labels.instance }} is over than 1s, current value is {{ $value }}."

  - name: broker
    rules:
      - alert: StorageWriteLatencyOverflow
        expr: pulsar_storage_write_latency{job="broker"} > 1000
        for: 30s
        labels:
          status: danger
        annotations:
          summary: "Topic write data to storage latency overflow is more than 1000."
          description: "Topic {{ $labels.topic }} is more than 1000 messages write to storage latency overflow , current value is {{ $value }}."

      - alert: TooManyTopics
        expr: sum(pulsar_topics_count{job="broker"}) by (cluster) > 1000000
        for: 30s
        labels:
          status: warning
        annotations:
          summary: "Topic count are over than 1000000."
          description: "Topic count in cluster {{ $labels.cluster }} is more than 1000000 , current value is {{ $value }}."

      - alert: TooManyProducersOnTopic
        expr: pulsar_producers_count > 10000
        for: 30s
        labels:
          status: warning
        annotations:
          summary: "Producers on topic are more than 10000."
          description: "Producers on topic {{ $labels.topic }} is more than 10000 , current value is {{ $value }}."

      - alert: TooManySubscriptionsOnTopic
        expr: pulsar_subscriptions_count > 100
        for: 30s
        labels:
          status: warning
        annotations:
          summary: "Subscriptions on topic are more than 100."
          description: "Subscriptions on topic {{ $labels.topic }} is more than 100 , current value is {{ $value }}."

      - alert: TooManyConsumersOnTopic
        expr: pulsar_consumers_count > 10000
        for: 30s
        labels:
          status: warning
        annotations:
          summary: "Consumers on topic are more than 10000."
          description: "Consumers on topic {{ $labels.topic }} is more than 10000 , current value is {{ $value }}."

      - alert: TooManyBacklogsOnTopic
        expr: pulsar_msg_backlog > 50000
        for: 30s
        labels:
          status: warning
        annotations:
          summary: "Backlogs of topic are more than 50000."
          description: "Backlogs of topic {{ $labels.topic }} is more than 50000 , current value is {{ $value }}."

      - alert: TooManyGeoBacklogsOnTopic
        expr: pulsar_replication_backlog > 50000
        for: 30s
        labels:
          status: warning
        annotations:
          summary: "Geo backlogs of topic are more than 50000."
          description: "Geo backlogs of topic {{ $labels.topic }} is more than 50000, current value is {{ $value }}."