
The openmetrics R package
now supports pushing
metrics to a Prometheus
Pushgateway
instance, which is useful for short-lived batch scripts or RMarkdown reports.
You might want to expose metrics from these scripts or reports to Prometheus in
order to improve monitoring and alerting on failures, but many of these
processes are not around long enough to run a webserver that Prometheus can pull
from.
This is where the Pushgateway comes in. It allows you to push metrics to a
centralised location where they can be aggregated and then scraped by Prometheus
itself. But beware: there are a limited
number of use cases for pushing
metrics, and you should always prefer pull-based methods when possible.
Continue Reading →
Grafana sports a feature called
Annotations
that allow you to label a timestamp on a dashboard with meaningful events –
most commonly deployments, campaigns, or outages:

(In this case annotating the simulated deployment of a Fluent
Bit container, which I’ve used to forward
container logs out of the cluster.)
Annotations can be input manually, but the only recommendations I’ve seen to
generate them automatically is to use something like
Loki,
or teaching your CI/CD
system
to interact with Grafana’s web API. However, if you’re running a simple
Prometheus + Grafana stack (say, using the Prometheus
Operator on Kubernetes),
you might be reticent to add more complexity to your setup just to get
deployment annotations.
Fortunately, there’s a simpler alternative for this narrow case: you can use the
process_start_time_seconds
metric from Prometheus to get an approximate idea
of when apps or pods were started. I haven’t seen this approach recommended
elsewhere, which is the purpose of this post.
Continue Reading →
My openmetrics package is now available on
CRAN. The package makes it
possible to add predefined and custom “metrics” to any R web application and
expose them on a /metrics
endpoint, where they can be consumed by
Prometheus.
Prometheus itself is a hugely popular, open-source monitoring and metrics
aggregation tool that is widely used in the Kubernetes ecosystem, usually
alongside Grafana for visualisation.
To illustrate, the following is a real Grafana dashboard built from the default
metrics exposed by the package for Plumber APIs:

Adding these to an existing Plumber API is extremely simple:
library(openmetrics)
srv <- plumber::plumb("plumber.R")
srv <- register_plumber_metrics(srv)
srv$run()
There is also built-in support for Shiny:
app <- shiny::shinyApp(...)
app <- register_shiny_metrics(app)
app
openmetrics is designed to be “batteries included” and offer good built-in
metrics for existing applications, but it is also possible (and encouraged!) to
add custom metrics tailored to your needs, and to expose them to Prometheus even
if you are not using Plumber or Shiny.
More detailed usage information is available in the package’s
README
.
Continue Reading →
The Plumber package is a popular way to make R
models or other code accessible to others with an HTTP API. It’s easy to get
started using Plumber, but it’s not always clear what to do after you have a
basic API up and running.
This post shares three simple endpoints I’ve used on dozens of Plumber APIs to
make them easier to debug and deploy in development and production environments:
/_ping
, /_version
, and /_sessioninfo
.
Continue Reading →
Each quarter the backup service BackBlaze publishes data on the failure rate of
its hundreds of thousands of hard drives, most recently on February
11th. Since the
failure rate of different models can vary widely, these posts sometimes make a
splash in the tech community. They’re also notable as the only large public
dataset on drive failures:

One of the things that strikes me about the presentation above is that BackBlaze
uses simple averages to compute the “Annualized Failure Rate” (AFR), despite the
fact that the actual count data vary by orders of magnitude, down to a single
digit. This might lead us to question the accuracy for smaller samples; in fact,
the authors are sensitive to this possibility and suppress data from drives with
less than 5,000 days of operation in Q4 2019 (although they are detailed in the
text of the article and available in their public datasets).
This looks like a perfect use case for a Bayesian approach: we want to combine a
prior expectation of the failure rate (which might be close to the historical
average across all drives) with observed failure events to produce a more
accurate estimate for each model.
Continue Reading →