Monitoring Go Applications With Prometheus

Author Profile Image
Scot Wells

Monitoring service level metrics provides your team with greater visibility into how your application performs, how your application is being used, and help identify potential performance bottlenecks.

Prometheus is an open source monitoring solution built with native service discovery support making it a perfect candidate for monitoring services in a dynamic environment. Prometheus supports pulling services from AWS, Kubernetes, Consul, and more!

When working with Prometheus to generate service level metrics, there are two typical approaches: running embedded in the service by exposing a /metrics endpoint on an HTTP server or creating a stand-alone process, building what’s called an exporter.

In this guide, we will walk through how to integrate Prometheus into a Go based service using the official golang client. Check out this full working example of adding metrics to a worker based Go service.


Getting Started

The Prometheus library provides a robust instrumentation library written in Golang that can be used to register, collect, and expose service metrics. Before we cover exposing service metrics in an application, let’s explore the different metrics types that are provided by the Prometheus libraries.

Metric Types

Prometheus clients expose four core metric types that can be utilized when exposing service metrics. Check out the Prometheus docs for more in-depth information on the different metric types.

Counter

A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. For example, you can use a counter to represent the number of requests served, tasks completed, or errors.

Gauge

A gauge is a metric that represents a single numerical value that can arbitrarily go up and down. Gauges are typically used for measured values like temperatures or current memory usage, but also “counts” that can go up and down, like the number of running goroutines or the number of in-flight requests.

Histogram

A histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values.

Summary

Similar to a histogram, a summary samples observations (usually things like request durations and response sizes). While it also provides a total count of observations and a sum of all observed values, it calculates configurable quantiles over a sliding time window.

The Prometheus HTTP Server

The first step to integrate prometheus metrics into your service is to initialize a HTTP server to serve Prometheus metrics. This server should listen on an internal port only available to your infrastructure; typically in the 9xxx range. The Prometheus team maintains a list of default port allocations you can reference when choosing a port.

// create a new mux server
server := http.NewServeMux()
// register a new handler for the /metrics endpoint
server.Handle("/metrics", promhttp.Handler())
// start an http server using the mux server
http.ListenAndServe(":9001", server)

This will create a new HTTP server running on port :9001 that will expose the metrics in the format that Prometheus expects. After starting the HTTP server, try running curl localhost:9001/metrics. You should see metrics in the following format.

# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 5

Exposing Service Metrics

So for this example, we’ll be adding prometheus stats to a queue system that processes background jobs. To simulate jobs with varying execution times, each job will sleep for a random interval. Each worker is configured to print a log line for each job it processes.

func main() {
  ...
  // create a channel with a 10,000 Job buffer
  jobChannel := make(chan *Job, 10000)
  // start the job processor
  go startJobProcessor(jobChannel)
  // start a goroutine to create some mock jobs
  go createJobs(jobChannel)
  ...
}

// Create a new worker that will process jobs on an job channel
func startWorker(workerID string, jobs <- chan *Job) {
  for {
    select {
    // read from the job channel
    case job := <-jobs:
      log.Printf(
        "[%s] Processing job with worker %s\n",
        time.Now().String(),
        workerID,
      )
      // fake processing the request
      time.Sleep(job.Sleep)
    }
  }
}

Try executing the application and see if you can determine the number of jobs being processed, the number of jobs waiting to be processed, or the amount of time spent processing jobs. Try also seeing what those statistics would be historically. Now, obviously we could record that information in log lines, ship those logs off to an ELK cluster, and call it a day. But, there is a trade-off with metrics vs logs.

Metrics tend to have a lower overhead when compared to logs due to their low-cost in storage and transfer. So how do we modify our service to add Prometheus stats? The first thing we need to do is modify our code to create the Prometheus metrics we want to capture.

So let’s focus on capturing three data-points: the number of jobs we’ve processed, the number of jobs waiting to process, and the average time it takes to process a job.

Adding Service Metrics

So first, let’s focus on capturing the total number of jobs that have been processed by our workers. This metric will also allow us to capture the number of jobs processed by a single worker. Once you’ve registered the counter, you will want to modify the worker function to track the number of jobs that have been processed.

var (
  totalCounterVec = prometheus.NewCounterVec(
    prometheus.CounterOpts{
      Namespace: "worker",
      Subsystem: "jobs",
      Name: "processed_total",
      Help: "Total number of jobs processed by the workers",
    },
    // We will want to monitor the worker ID that processed the
    // job, and the type of job that was processed
    []string{"worker_id", "type"},
  )
)

func init() {
  ...
  // register with the prometheus collector
  prometheus.MustRegister(totalCounterVec)
  ...
}

func startWorker(workerID string, jobs <-chan *Job) {
  for {
    select {
    case job := <-jobs:
      ...
      totalCounterVec.WithLabelValues(workerID, job.Type).Inc()
      ...
    }
  }
}

Once the service has been updated, execute it again and query the prometheus endpoint. You should see new metrics available in the prometheus output that captures the number of jobs that has been processed by a given worker and for a given type. The output should look similar to the following.

# HELP worker_jobs_processed_total Total jobs processed by the workers
# TYPE worker_jobs_processed_total counter
worker_jobs_processed_total{type="activation",      worker_id="1"} 22
worker_jobs_processed_total{type="activation",      worker_id="2"} 16
worker_jobs_processed_total{type="customer_renew",  worker_id="1"} 1
worker_jobs_processed_total{type="deactivation",    worker_id="2"} 22
worker_jobs_processed_total{type="email",           worker_id="1"} 20
worker_jobs_processed_total{type="order_processed", worker_id="2"} 13
worker_jobs_processed_total{type="transaction",     worker_id="1"} 16

Next, try seeing if you can update the worker to capture the number of inflight jobs (Hint: use a Guage 😉) and the amount of time a worker has spent processing a job (Hint: use a Histogram 😉).


Analyzing the Data

Before we are able to analyze the metrics exposed by the service, we need to configure Prometheus to pull metrics from the service.

Setting up Prometheus

So now that we’ve updated the service to expose Prometheus metrics, we need to configure Prometheus to pull the metrics from our service. To do that, we will setup a new prometheus scrape configuration to pull from the service. For more information on the scrape configuration, check out the Prometheus documentation

scrape_configs:
  - job_name: 'demo'
    # scrape the service every second
    scrape_interval: 1s
    # setup the static configs
    static_configs:
      - targets: ['docker.for.mac.localhost:9009']

Next, start the Prometheus server to begin capturing the metrics exposed by the service. You should be able to start Prometheus with the following docker compose service configuration.

services:
  prometheus:
    image: 'prom/prometheus:latest'
    ports:
    - '8080:8080'
    volumes:
    - './prometheus.yml:/etc/prometheus/prometheus.yml'

Querying the Data

Note: For more information on querying Prometheus, check out the querying documentation

Now that Prometheus is scraping our service endpoint for metrics, you can use the Prometheus Query Language to generate meaningful metrics about your application. For example, one important metric would be the number of jobs that our workers are currently processing per second. We can generate this using the rate() function. The following query will generate the average number of jobs processed per-second over a 5 minute interval.

sum by (type) (rate(worker_jobs_processed_total[5m]))

Another useful metric for this service would be to monitor the rate of jobs being added to the queue. Since the inflight jobs metric is using a Gauge, we can use the deriv() function to calculate the per-second rate of change to the number of pending jobs. This metric can be helpful to determine if you have enough workers running to process your current job volume.

sum by (type) (deriv(worker_jobs_inflight[5m]))

Another useful metric we can calculate from Prometheus is the amount of time it has taken on average for a worker to process it’s jobs. For this metric, we will use the rate() function to compare the seconds spent processing jobs, to the number of jobs that were completed.

sum(
  rate(worker_jobs_process_time_seconds_sum[5m])
  /
  rate(worker_jobs_process_time_seconds_count[5m])
)

Since the worker_jobs_process_time_seconds metric is a Histogram, we can use the histogram_quantile() function to show to 50th, 95th, and 100th percentiles of the time taken for a worker to finish it’s jobs. This will provide us with better visibility into the distribution of processing time between workers. Note that the quantile function relies on the le label to function properly, and must be included in any aggregations. (Huge thanks to @jwenz723 for these example queries!)

50th Percentile

histogram_quantile(
  0.5,
  sum by (worker, le) (rate(worker_jobs_process_time_seconds_bucket[5m]))
)

95th Percentile

histogram_quantile(
  0.95,
  sum by (worker, le) (rate(worker_jobs_process_time_seconds_bucket[5m]))
)

100th Percentile

histogram_quantile(
  1,
  sum by (worker, le) (rate(worker_jobs_process_time_seconds_bucket[5m]))
)

Lastly, I would recommend setting up Grafana to query your Prometheus server for metrics. Grafana is an amazing open-source graphing solution that can help you turn your Prometheus statistics into beautiful operational dashboards. Here are some of the dashboards created from this walk-through.

Example Grafana dashboard using metrics from the blog post

Check out the demo Grafana dashboard in this example on adding Prometheus metrics to your Golang service for more examples.


Questions or Feedback? Comment below or reach out on twitter!


Comments powered by Disqus